How does Linux perf calculate the cache-references and cache-misses events

The built-in perf events that you are interested in are mapping to the following hardware performance monitoring events on your processor:

  523,288,816      cache-references        (architectural event: LLC Reference)                             
  205,331,370      cache-misses            (architectural event: LLC Misses) 
  237,794,728      L1-dcache-load-misses   L1D.REPLACEMENT
3,495,080,007      L1-dcache-loads         MEM_INST_RETIRED.ALL_LOADS
2,039,344,725      L1-dcache-stores        MEM_INST_RETIRED.ALL_STORES                     
  531,452,853      L1-icache-load-misses   ICACHE_64B.IFTAG_MISS
   77,062,627      LLC-loads               OFFCORE_RESPONSE (MSR bits 0, 16, 30-37)
   27,462,249      LLC-load-misses         OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37)
   15,039,473      LLC-stores              OFFCORE_RESPONSE (MSR bits 1, 16, 30-37)
    3,829,429      LLC-store-misses        OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37)

All of these events are documented in the Intel manual Volume 3. For more information on how to map perf events to native events, see: Hardware cache events and perf and How does perf use the offcore events?.

But how does perf calculate cache-misses event? From my understanding,
if the cache-misses counts the number of memory accesses that cannot
be served by the CPU cache, then shouldn’t it be equal to
LLC-loads-misses + LLC-store-misses? Clearly in my case, the
cache-misses is much higher than the Last-Level-Cache-Misses number.

LLC-load-misses and LLC-store-misses count only cacheable data read requests and RFO requests, respectively, that miss in the L3 cache. LLC-load-misses also includes reads for page walking. Both exclude hardware and software prefetching. (The difference compared to Haswell is that some types of prefetch requests are counted.)

cache-misses also includes prefetch requests and code fetch requests that miss in the L3 cache. All of these events only count core-originating requests. They include requests from uops irrespective of whether end up retiring and irrespective of the source of the response. It’s unclear to me how a prefetch promoted to demand is counted.

Overall, I think cache-misses is always larger than LLC-load-misses + LLC-store-misses and cache-references is always larger than LLC-loads + LLC-stores.

The same confusion goes to cache-reference. It is much lower than
L1-dcache-loads and much higher then LLC-loads+LLC-stores

It’s only guaranteed that cache-reference is larger than cache-misses because the former counts requests irrespective of whether they miss the L3. It’s normal for L1-dcache-loads to be larger than cache-reference because core-originated loads usually occur only when you have load instructions and because of the cache locality exhibited by many programs. But it’s not necessarily always the case because of hardware prefetches.

The L1-* and LLC-* events are easy to understand, as I can tell they
are read from the hardware counters in CPU.

No, it’s a trap. They are not easy to understand.

Leave a Comment