The built-in perf
events that you are interested in are mapping to the following hardware performance monitoring events on your processor:
523,288,816 cache-references (architectural event: LLC Reference)
205,331,370 cache-misses (architectural event: LLC Misses)
237,794,728 L1-dcache-load-misses L1D.REPLACEMENT
3,495,080,007 L1-dcache-loads MEM_INST_RETIRED.ALL_LOADS
2,039,344,725 L1-dcache-stores MEM_INST_RETIRED.ALL_STORES
531,452,853 L1-icache-load-misses ICACHE_64B.IFTAG_MISS
77,062,627 LLC-loads OFFCORE_RESPONSE (MSR bits 0, 16, 30-37)
27,462,249 LLC-load-misses OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37)
15,039,473 LLC-stores OFFCORE_RESPONSE (MSR bits 1, 16, 30-37)
3,829,429 LLC-store-misses OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37)
All of these events are documented in the Intel manual Volume 3. For more information on how to map perf
events to native events, see: Hardware cache events and perf and How does perf use the offcore events?.
But how does perf calculate cache-misses event? From my understanding,
if the cache-misses counts the number of memory accesses that cannot
be served by the CPU cache, then shouldn’t it be equal to
LLC-loads-misses + LLC-store-misses? Clearly in my case, the
cache-misses is much higher than the Last-Level-Cache-Misses number.
LLC-load-misses
and LLC-store-misses
count only cacheable data read requests and RFO requests, respectively, that miss in the L3 cache. LLC-load-misses
also includes reads for page walking. Both exclude hardware and software prefetching. (The difference compared to Haswell is that some types of prefetch requests are counted.)
cache-misses
also includes prefetch requests and code fetch requests that miss in the L3 cache. All of these events only count core-originating requests. They include requests from uops irrespective of whether end up retiring and irrespective of the source of the response. It’s unclear to me how a prefetch promoted to demand is counted.
Overall, I think cache-misses
is always larger than LLC-load-misses
+ LLC-store-misses
and cache-references
is always larger than LLC-loads
+ LLC-stores
.
The same confusion goes to cache-reference. It is much lower than
L1-dcache-loads and much higher then LLC-loads+LLC-stores
It’s only guaranteed that cache-reference
is larger than cache-misses
because the former counts requests irrespective of whether they miss the L3. It’s normal for L1-dcache-loads
to be larger than cache-reference
because core-originated loads usually occur only when you have load instructions and because of the cache locality exhibited by many programs. But it’s not necessarily always the case because of hardware prefetches.
The L1-* and LLC-* events are easy to understand, as I can tell they
are read from the hardware counters in CPU.
No, it’s a trap. They are not easy to understand.