How does the CPU cache affect the performance of a C program

Question

The plots show the combination of several complex low-level effects (mainly cache trashing & prefetching issues). I assume the target platform is a mainstream modern processor with cache lines of 64 bytes (typically a x86 one).

I can reproduce the problem on my i5-9600KF processor. Here is the resulting plot:

First of all, when nj is small, the gap between fetched address (ie. strides) is small and cache lines are relatively efficiently used. For example, when nj = 1, the access is contiguous. In this case, the processor can efficiently prefetch the cache lines from the DRAM so to hide its high latency. There is also a good spatial cache locality since many contiguous items share the same cache line. When nj=2, only half the value of a cache line is used. This means the number of requested cache line is twice bigger for the same number of operations. That being said the time is not much bigger due to the relatively high latency of adding two floating-point numbers resulting in a compute-bound code. You can unroll the loop 4 times and use 4 different sum variables so that (mainstream modern) processors can add multiple values in parallel. Note that most processors can also load multiple values from the cache per cycle. When nj = 4 a new cache line is requested every 2 cycles (since a double takes 8 bytes). As a result, the memory throughput can become so big that the computation becomes memory-bound. One may expect the time to be stable for nj >= 8 since the number of requested cache line should be the same, but in practice processors prefetch multiple contiguous cache lines so not to pay the overhead of the DRAM latency which is huge in this case. The number of prefetched cache lines is generally between 2 to 4 (AFAIK such prefetching strategy is disabled on Intel processors when the stride is bigger than 512, so when nj >= 64. This explains why the timings are sharply increasing when nj < 32 and they become relatively stable with 32 <= nj <= 256 with exceptions for peaks.

The regular peaks happening when nj is a multiple of 16 are due to a complex cache effect called cache thrashing. Modern cache are N-way associative with N typically between 4 and 16. For example, here are statistics on my i5-9600KF processors:

Cache 0: L1 data cache,        line size 64,  8-ways,    64 sets, size 32k 
Cache 1: L1 instruction cache, line size 64,  8-ways,    64 sets, size 32k 
Cache 2: L2 unified cache,     line size 64,  4-ways,  1024 sets, size 256k 
Cache 3: L3 unified cache,     line size 64, 12-ways, 12288 sets, size 9216k

This means that two fetched values from the DRAM with the respective address A1 and A2 can results in conflicts in my L1 cache if (A1 % 32768) / 64 == (A2 % 32768) / 64. In this case, the processor needs to choose which cache line to replace from a set of N=8 cache lines. There are many cache replacement policy and none is perfect. Thus, some useful cache line are sometime evicted too early resulting in additional cache misses required later. In pathological cases, many DRAM locations can compete for the same cache lines resulting in excessive cache misses. More information about this can be found also in this post.

Regarding the nj stride, the number of cache lines that can be effectively used in the L1 cache is limited. For example, if all fetched values have the same address modulus the cache size, then only N cache lines (ie. 8 for my processor) can actually be used to store all the values. Having less cache lines available is a big problem since the prefetcher need a pretty large space in the cache so to store the many cache lines needed later. The smaller the number of concurrent fetches, the lower memory throughput. This is especially true here since the latency of fetching 1 cache line from the DRAM is about several dozens of nanoseconds (eg. ~70 ns) while its bandwidth is about dozens of GiB/s (eg. ~40 GiB/s): dozens of cache lines (eg. ~40) should be fetched concurrently so to hide the latency and saturate the DRAM.

Here is the simulation of the number of cache lines that can be actually used in my L1 cache regarding the value of the nj:

 nj  #cache-lines
  1      512
  2      512
  3      512
  4      512
  5      512
  6      512
  7      512
  8      512
  9      512
 10      512
 11      512
 12      512
 13      512
 14      512
 15      512
 16      256    <----
 17      512
 18      512
 19      512
 20      512
 21      512
 22      512
 23      512
 24      512
 25      512
 26      512
 27      512
 28      512
 29      512
 30      512
 31      512
 32      128    <----
 33      512
 34      512
 35      512
 36      512
 37      512
 38      512
 39      512
 40      512
 41      512
 42      512
 43      512
 44      512
 45      512
 46      512
 47      512
 48      256    <----
 49      512
 50      512
 51      512
 52      512
 53      512
 54      512
 55      512
 56      512
 57      512
 58      512
 59      512
 60      512
 61      512
 62      512
 63      512
 64       64    <----
==============
 80      256
 96      128
112      256
128       32
144      256
160      128
176      256
192       64
208      256
224      128
240      256
256       16
384       32
512        8
1024       4

We can see that the number of available cache lines is smaller when nj is a multiple of 16. In this case, the prefetecher will preload data into cache lines that are likely evicted early by subsequent fetched (done concurrently). Loads instruction performed in the code are more likely to result in cache misses when the number of available cache line is small. When a cache miss happen, the value need then to be fetched again from the L2 or even the L3 resulting in a slower execution. Note that the L2 cache is also subject to the same effect though it is less visible since it is larger. The L3 cache of modern x86 processors makes use of hashing to better distributes things to reduce collisions from fixed strides (at least on Intel processors and certainly on AMD too though AFAIK this is not documented).

Here is the timings on my machine for some peaks:

  32 4.63600000e-03 4.62298020e-03 4.06400000e-03 4.97300000e-03
  48 4.95800000e-03 4.96994059e-03 4.60400000e-03 5.59800000e-03
  64 5.01600000e-03 5.00479208e-03 4.26900000e-03 5.33100000e-03
  96 4.99300000e-03 5.02284158e-03 4.94700000e-03 5.29700000e-03
 128 5.23300000e-03 5.26405941e-03 4.93200000e-03 5.85100000e-03
 192 4.76900000e-03 4.78833663e-03 4.60100000e-03 5.01600000e-03
 256 5.78500000e-03 5.81666337e-03 5.77600000e-03 6.35300000e-03
 384 5.25900000e-03 5.32504950e-03 5.22800000e-03 6.75800000e-03
 512 5.02700000e-03 5.05165347e-03 5.02100000e-03 5.34400000e-03
1024 5.29200000e-03 5.33059406e-03 5.28700000e-03 5.65700000e-03

As expected, the timings are overall bigger in practice for the case where the number of available cache lines is much smaller. However, when nj >= 512, the results are surprising since they are significantly faster than others. This is the case where the number of available cache lines is equal to the number of ways of associativity (N). My guess is that this is because Intel processors certainly detect this pathological case and optimize the prefetching so to reduce the number of cache misses (using line-fill buffers to bypass the L1 cache — see below).

Finally, for large nj stride, a bigger nj should results in higher overheads mainly due to the translation lookaside buffer (TLB): there are more page addresses to translate with bigger nj and the number of TLB entries is limited. In fact this is what I can observe on my machine: timings tends to slowly increase in a very stable way unlike on your target platform.

I cannot really explain this very strange behavior yet.
Here is some wild guesses:

The OS could tend to uses more huge pages when nj is large (so to reduce de overhead of the TLB) since wider blocks are allocated. This could result in more concurrency for the prefetcher as AFAIK it cannot cross page
boundaries. You can try to check the number of allocated (transparent) huge-pages (by looking AnonHugePages in /proc/meminfo in Linux) or force them to be used in this case (using an explicit memmap), or possibly by disabling them. My system appears to make use of 2 MiB transparent huge-pages independently of the nj value.
If the target architecture is a NUMA one (eg. new AMD processors or a server with multiple processors having their own memory), then the OS could allocate pages physically stored on another NUMA node because there is less space available on the current NUMA node. This could result in higher performance due to the bigger throughput (though the latency is higher). You can control this policy with numactl on Linux so to force local allocations.

For more information about this topic, please read the great document What Every Programmer Should Know About Memory. Moreover, a very good post about how x86 cache works in practice is available here.

Removing the peaks

To remove the peaks due to cache trashing on x86 processors, you can use non-temporal software prefetching instructions so cache lines can be fetched in a non-temporal cache structure and into a location close to the processor that should not cause cache trashing in the L1 (if possible). Such cache structure is typically a line-fill buffers (LFB) on Intel processors and the (equivalent) miss address buffers (MAB) on AMD Zen processors. For more information about non-temporal instructions and the LFB, please read this post and this one. Here is the modified code that also include a loop unroling optimization to speed up the code when nj is small:

double sum_column(uint64_t ni, uint64_t nj, double* const data)
{
    double sum0 = 0.0;
    double sum1 = 0.0;
    double sum2 = 0.0;
    double sum3 = 0.0;

    if(nj % 16 == 0)
    {
        // Cache-bypassing prefetch to avoid cache trashing
        const size_t distance = 12;
        for (uint64_t i = 0; i < ni; ++i) {
            _mm_prefetch(&data[(i+distance)*nj+0], _MM_HINT_NTA);
            sum0 += data[i*nj+0];
        }
    }
    else
    {
        // Unrolling is much better for small strides
        for (uint64_t i = 0; i < ni; i+=4) {
            sum0 += data[(i+0)*nj+0];
            sum1 += data[(i+1)*nj+0];
            sum2 += data[(i+2)*nj+0];
            sum3 += data[(i+3)*nj+0];
        }
    }
    
    return sum0 + sum1 + sum2 + sum3;
}

Here is the result of the modified code:

We can see that peaks no longer appear in the timings. We can also see that the values are much bigger due to dt0 being about 4 times smaller (due to the loop unrolling).

Note that cache trashing in the L2 cache is not avoided with this method in practice (at least on Intel processors). This means that the effect is still here with huge nj strides multiple of 512 (4 KiB) on my machine (it is actually a slower than before, especially when nj >= 2048). It may be a good idea to stop the prefetching when (nj%512) == 0 && nj >= 512 on x86 processors. The effect AFAIK, there is no way to address this problem. That being said, this is a very bad idea to perform such big strided accesses on very-large data structures.

Note that distance should be carefully chosen since early prefetching can result cache line being evicted before they are actually used (so they need to be fetched again) and late prefetching is not much useful. I think using value close to the number of entries in the LFB/MAB is a good idea (eg. 12 on Skylake/KabyLake/CannonLake, 22 on Zen-2).

Removing the peaks

More Related Contents:

Leave a Comment Cancel reply