How to optimize these loops (with compiler optimization disabled)?

Question

Re-posting a modified version of my answer from optimized sum of an array of doubles in C, since that question got voted down to -5. The OP of the other question phrased it more as “what else is possible”, so I took him at his word and info-dumped about vectorizing and tuning for current CPU hardware. 🙂

The OP of that question eventually said he wasn’t allowed to use compiler options higher than -O0, which I guess is the case here, too.

Summary:

Why using -O0 distorts things (unfairly penalizes things that are fine in normal code for a normal compiler). Using -O0 (the gcc/clang default) so your loops don’t optimize away is not a valid excuse or a useful way to find out what will be faster with normal optimization enabled. (See also Idiomatic way of performance evaluation? for more about benchmark methods and pitfalls, like ways to enable optimization but still stop the compiler from optimizing away the work you want to measure.)
Stuff that’s wrong with the assignment.
Types of optimizations. FP latency vs. throughput, and dependency chains. Link to Agner Fog’s site. (Essential reading for optimization).
Experiments getting the compiler to optimize it (after fixing it to not optimize away). Best result with auto-vectorization (no source changes): gcc: half as fast as an optimal vectorized loop. clang: same speed as a hand-vectorized loop.
Some more comments on why bigger expressions are a perf win with -O0 only.
Source changes to get good performance without -ffast-math, making the code closer to what we want the compiler to do. Also some rules-lawyering ideas that would be useless in the real-world.
Vectorizing the loop with GCC architecture-neutral vectors, to see how close the auto-vectorizing compilers came to matching the performance of ideal asm code (since I checked the compiler output).

I think the point of the assignment is to sort of teach assembly-language performance optimizations using C with no compiler optimizations. This is silly. It’s mixing up things the compiler will do for you in real life with things that do require source-level changes.

See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?

-O0 doesn’t just “not optimize”, it makes the compiler store variables to memory after every statement instead of keeping them in registers. It does this so you get the “expected” results if you set a breakpoint with gdb and modify the value (in memory) of a C variable. Or even if you jump to another line in the same function. So each C statement has to be compiled to an independent block of asm that starts and ends with all variables in memory. For a modern portable compiler like gcc which already transforms through multiple internal representations of program flow on the way from source to asm, this part of -O0 requires explicitly de-optimizing its graph of data flow back into separate C statements. These store/reloads lengthen every loop-carried dependency chain so it’s horrible for tiny loops if the loop counter is kept in memory. (e.g. 1 cycle per iteration for inc reg vs. 6c for inc [mem], creating a bottleneck on loop counter updates in tight loops).

With gcc -O0, the register keyword lets gcc keep a var in a register instead of memory, and thus can make a big difference in tight loops (Example on the Godbolt Compiler explorer). But that’s only with -O0. In real code, register is meaningless: the compiler attempts to optimally use the available registers for variables and temporaries. register is already deprecated in ISO C++11 (but not C11), and there’s a proposal to remove it from the language along with other obsolete stuff like trigraphs.

With an extra variables involved, -O0 hurts array indexing a bit more than pointer incrementing.

Array indexing usually makes code easier to read. Compilers sometimes fail to optimize stuff like array[i*width + j*width*height], so it’s a good idea to change the source to do the strength-reduction optimization of turning the multiplies into += adds.

At an asm level, array indexing vs. pointer incrementing are close to the same performance. (x86 for example has addressing modes like [rsi + rdx*4] which are as fast as [rdi]. except on Sandybridge and later.) It’s the compiler’s job to optimize your code by using pointer incrementing even when the source uses array indexing, when that’s faster.

For good performance, you have to be aware of what compilers can and can’t do. Some optimizations are “brittle”, and a small seemingly-innocent change to the source will stop the compiler from doing an optimization that was essential for some code to run fast. (e.g. pulling a constant computation out of a loop, or proving something about how different branch conditions are related to each other, and simplifying.)

Besides all that, it’s a crap sample because it doesn’t have anything to stop a smart compiler from optimizing away the entire thing. It doesn’t even print the sum. Even gcc -O1 (instead of -O3) threw away some of the looping.

(You can fix this by printing sum at the end. gcc and clang don’t seem to realize that calloc returns zeroed memory, and optimize it away to 0.0. See my code below.)

Normally you’d put your code in a function, and call it in a loop from main() in another file. And compile them separately, without whole-program cross-file optimisation, so the compiler can’t do optimisations based on the compile-time constants you call it with. The repeat-loop being wrapped so tightly around the actual loop over the array is causing havoc with gcc’s optimizer (see below).

Also, the other version of this question had an uninitialized variable kicking around. It looks like long int help was introduced by the OP of that question, not the prof. So I will have to downgrade my “utter nonsense” to merely “silly”, because the code doesn’t even print the result at the end. That’s the most common way of getting the compiler not to optimize everything away in a microbenchmark like this.

I assume your prof mentioned a few things about performance. There are a crapton of different things that could come into play here, many of which I assume didn’t get mentioned in a 2nd-year CS class.

Besides multithreading with openmp, there’s vectorizing with SIMD. There are also optimizations for modern pipelined CPUs: specifically, avoid having one long dependency chain.

Further essential reading:

Agner Fog’s guides for optimizing C and asm for x86. Some of it applies to all CPUs.
What Every Programmer Should Know About Memory

Your compiler manual is also essential, esp. for floating point code. Floating point has limited precision, and is not associative. The final sum does depend on which order you do the additions in. Usually the difference in rounding error is small, so the compiler can get a big speedup by re-ordering things if you use -ffast-math to allow it.

Instead of just unrolling, keep multiple accumulators which you only add up at the end, like you’re doing with the sum0..sum9 unroll-by-10. FP instructions have medium latency but high throughput, so you need to keep multiple FP operations in flight to keep the floating point execution units saturated.

If you need the result of the last op to be complete before the next one can start, you’re limited by latency. For FP add, that’s one per 3 cycles. In Intel Sandybridge, IvB, Haswell, and Broadwell, the throughput of FP add is one per cycle. So you need to keep at least 3 independent ops that can be in flight at once to saturate the machine. For Skylake, it’s 2 per cycle with latency of 4 clocks. (On the plus side for Skylake, FMA is down to 4 cycle latency.)

In this case, there’s also basic stuff like pulling things out of the loop, e.g. help += ARRAY_SIZE.

Compiler Options

Lets start by seeing what the compiler can do for us.

I started out with the original inner loop, with just help += ARRAY_SIZE pulled out, and adding a printf at the end so gcc doesn’t optimize everything away. Let’s try some compiler options and see what we can achieve with gcc 4.9.2 (on my i5 2500k Sandybridge. 3.8GHz max turbo (slight OC), 3.3GHz sustained (irrelevant for this short benchmark)):

gcc -O0 fast-loop-cs201.c -o fl: 16.43s performance is a total joke. Variables are stored to memory after every operation, and re-loaded before the next. This is a bottleneck, and adds a lot of latency. Not to mention losing out on actual optimisations. Timing / tuning code with -O0 is not useful.
-O1: 4.87s
-O2: 4.89s
-O3: 2.453s (uses SSE to do 2 at once. I’m of course using a 64bit system, so hardware support for -msse2 is baseline.)
-O3 -ffast-math -funroll-loops: 2.439s
-O3 -march=sandybridge -ffast-math -funroll-loops: 1.275s (uses AVX to do 4 at once.)
-Ofast ...: no gain
-O3 -ftree-parallelize-loops=4 -march=sandybridge -ffast-math -funroll-loops: 0m2.375s real, 0m8.500s user. Looks like locking overhead killed it. It only spawns the 4 threads total, but the inner loop is too short for it to be a win: it collects the sums every time, instead of giving each thread 1/4 of the outer loop iterations.
-Ofast -fprofile-generate -march=sandybridge -ffast-math, run it, then
-Ofast -fprofile-use -march=sandybridge -ffast-math: 1.275s. profile-guided optimization is a good idea when you can exercise all the relevant code-paths, so the compiler can make better unrolling / inlining decisions.
clang-3.5 -Ofast -march=native -ffast-math: 1.070s. (clang 3.5 is too old to support -march=sandybridge. You should prefer to use a compiler version that’s new enough to know about the target architecture you’re tuning for, esp. if using -march to make code that doesn’t need to run on older architectures.)

gcc -O3 vectorizes in a hilarious way: The inner loop does 2 (or 4) iterations of the outer loop in parallel, by broadcasting one array element to all elements of an xmm (or ymm) register, and doing an addpd on that. So it sees the same values are being added repeatedly, but even -ffast-math doesn’t let gcc just turn it into a multiply. Or switch the loops.

clang-3.5 vectorizes a lot better: it vectorizes the inner loop, instead of the outer, so it doesn’t need to broadcast. It even uses 4 vector registers as 4 separate accumulators. It knows that calloc only returns 16-byte aligned memory (on x86-64 System V), and when tuning for Sandybridge (before Haswell) it knows that 32-byte loads have a big penalty when misaligned. And that splitting them isn’t too expensive since a 32-byte load takes 2 cycles in a load port anyway.

vmovupd -0x60(%rbx,%rcx,8),%xmm4
vinsertf128 $0x1,-0x50(%rbx,%rcx,8),%ymm4,%ymm4

This is worse on later CPUs, especially when the data does happen to be aligned at run-time; see Why doesn’t gcc resolve _mm256_loadu_pd as single vmovupd? about GCC versions where -mavx256-split-unaligned-load was on by default with -mtune=generic.

It’s actually slower when I tell it that the array is aligned. (with a stupid hack like array = (double*)((ptrdiff_t)array & ~31); which actually generates an instruction to mask off the low 5 bits, because clang-3.5 doesn’t support gcc’s __builtin_assume_aligned.) In that case it uses a tight loop of 4x vaddpd mem, %ymm, %ymm. It only runs about 0.65 insns per cycle (and 0.93 uops / cycle), according to perf, so the bottleneck isn’t front-end.

I checked with a debugger, and calloc is indeed returning a pointer that’s an odd multiple of 16. (glibc for large allocations tends to allocate new pages, and put bookkeeping info in the initial bytes, always misaligning to any boundary wider than 16.) So half the 32B memory accesses are crossing a cache line, causing a big slowdown. It is slightly faster to do two separate 16B loads when your pointer is 16B-aligned but not 32B-aligned, on Sandybridge. (gcc enables -mavx256-split-unaligned-load and ...-store for -march=sandybridge, and also for the default tune=generic with -mavx, which is not so good especially for Haswell or with memory that’s usually aligned by the compiler doesn’t know about it.)

Source level changes

As we can see from clang beating gcc, multiple accumulators are excellent. The most obvious way to do this would be:

for (j = 0; j < ARRAY_SIZE; j+=4) {  // unroll 4 times
    sum0 += array[j];
    sum1 += array[j+1];
    sum2 += array[j+2];
    sum3 += array[j+3];
}

and then don’t collect the 4 accumulators into one until after the end of the outer loop.

Your (from the other question) source change of

sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];

actually has a similar effect, thanks to out-of-order execution. Each group of 10 is a separate dependency chain. order-of-operations rules say the j values get added together first, and then added to sum. So the loop-carried dependency chain is still only the latency of one FP add, and there’s lots of independent work for each group of 10. Each group is a separate dependency chain of 9 adds, and takes few enough instructions for the out-of-order execution hardware to see the start of the next chain and, and find the parallelism to keep those medium latency, high throughput FP execution units fed.

With -O0, as your silly assignment apparently requires, values are stored to RAM at the end of every statement. Writing longer expressions without updating any variables, even temporaries, will make -O0 run faster, but it’s not a useful optimisation. Don’t waste your time on changes that only help with -O0, esp. not at the expense of readability.

Using 4 accumulator variables and not adding them together until the end of the outer loop defeats clang’s auto-vectorizer. It still runs in only 1.66s (vs. 4.89 for gcc’s non-vectorized -O2 with one accumulator). Even gcc -O2 without -ffast-math also gets 1.66s for this source change. Note that ARRAY_SIZE is known to be a multiple of 4, so I didn’t include any cleanup code to handle the last up-to-3 elements (or to avoid reading past the end of the array, which would happen as written now). It’s really easy to get something wrong and read past the end of the array when doing this.

GCC, on the other hand, does vectorize this, but it also pessimises (un-optimises) the inner loop into a single dependency chain. I think it’s doing multiple iterations of the outer loop, again.

Using gcc’s platform-independent vector extensions, I wrote a version which compiles into apparently-optimal code:

// compile with gcc -g -Wall -std=gnu11 -Ofast -fno-tree-vectorize -march=native fast-loop-cs201.vec.c -o fl3-vec

#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>

// You are only allowed to make changes to this code as specified by the comments in it.

// The code you submit must have these two values.
#define N_TIMES     600000
#define ARRAY_SIZE   10000

int main(void)
{
    double  *array = calloc(ARRAY_SIZE, sizeof(double));
    double  sum = 0;
    int     i;

    // You can add variables between this comment ...
    long int help = 0;

    typedef double v4df __attribute__ ((vector_size (8*4)));
    v4df sum0={0}, sum1={0}, sum2={0}, sum3={0};

    const size_t array_bytes = ARRAY_SIZE*sizeof(double);
    double *aligned_array = NULL;

    // this more-than-declaration could go in an if(i == 0) block for strict compliance with the rules
    if ( posix_memalign((void**)&aligned_array, 32, array_bytes) ) {
        exit (1);
    }
    memcpy(aligned_array, array, array_bytes);  // In this one case: faster to align once and have no extra overhead for N_TIMES through the loop

    // ... and this one.

    // Please change 'your name' to your actual name.
    printf("CS201 - Asgmt 4 - I. Forgot\n");

    for (i = 0; i < N_TIMES; i++) {

        // You can change anything between this comment ...
    /*
    #if defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__) >= 407 // GCC 4.7 or later.
        array = __builtin_assume_aligned(array, 32);
    #else
        // force-align for other compilers.  This loop-invariant will be done outside the loop.
        array = (double*) ((ptrdiff_t)array & ~31);
    #endif
    */

        assert ( ARRAY_SIZE / (4*4) == (ARRAY_SIZE+15) / (4*4) );  // We don't have a cleanup loop to handle where the array size isn't a multiple of 16

        // incrementing pointers can be more efficient than indexing arrays
        // esp. on recent Intel where micro-fusion only works with one-register addressing modes
        // of course, the compiler can always generate pointer-incrementing asm from array-indexing source
        const double *start = aligned_array;

        while ( (ptrdiff_t)start & 31 ) {
            // annoying loops like this are the reason people use aligned buffers
            sum += *start++;        // scalar until we reach 32B alignment
            // in practice, this loop doesn't run, because we copy into an aligned buffer
            // This will also require a cleanup loop, and break our multiple-of-16 doubles assumption.
        }

        const v4df *end = (v4df *)(aligned_array+ARRAY_SIZE);
        for (const v4df *p = (v4df *)start ; p+3 < end; p+=4) {
            sum0 += p[0];   // p+=4 increments the pointer by 4 * 4 * 8 bytes
            sum1 += p[1];       // make sure you keep track of what you're incrementing
            sum2 += p[2];
            sum3 += p[3];
        }

        // the compiler might be smart enough to pull this out of the inner loop
        // in fact, gcc turns this into a 64bit movabs outside of both loops :P
        help+= ARRAY_SIZE;

            // ... and this one. But your inner loop must do the same
            // number of additions as this one does.

        /* You could argue legalese and say that
         if (i == 0) {
             for (j ...)
                 sum += array[j];
             sum *= N_TIMES;
         }
         * still does as many adds in its *INNER LOOP*, but it just doesn't run it as often
         */
    }

    // You can add some final code between this comment ...
    sum0 = (sum0 + sum1) + (sum2 + sum3);
    sum += sum0[0] + sum0[1] + sum0[2] + sum0[3];
    printf("sum = %g; help=%ld\n", sum, help);  // defeat the compiler.

    free (aligned_array);
    free (array);  // not strictly necessary, because this is the end of main().  Leaving it out for this special case is a bad example for a CS class, though.
    // ... and this one.

    return 0;
}

The inner loop compiles to:

  4007c0:       c5 e5 58 19             vaddpd (%rcx),%ymm3,%ymm3
  4007c4:       48 83 e9 80             sub    $0xffffffffffffff80,%rcx   # subtract -128, because -128 fits in imm8 instead of requiring an imm32 to encode add $128, %rcx
  4007c8:       c5 f5 58 49 a0          vaddpd -0x60(%rcx),%ymm1,%ymm1   # one-register addressing mode can micro-fuse
  4007cd:       c5 ed 58 51 c0          vaddpd -0x40(%rcx),%ymm2,%ymm2
  4007d2:       c5 fd 58 41 e0          vaddpd -0x20(%rcx),%ymm0,%ymm0
  4007d7:       4c 39 c1                cmp    %r8,%rcx  # compare with end with p
  4007da:       75 e4                   jne    4007c0 <main+0xb0>

(For more, see online compiler output at the godbolt compiler explorer. The -xc compiler option compiles as C, not C++. The inner loop is from .L3 to jne .L3. See the x86 tag wiki for x86 asm links. See also this q&a about micro-fusion not happening on SnB-family, which Agner Fog’s guides don’t cover).

performance:

$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,stalled-cycles-frontend,stalled-cycles-backend,L1-dcache-load-misses,cache-misses ./fl3-vec 
CS201 - Asgmt 4 - I. Forgot
sum = 0; help=6000000000

 Performance counter stats for './fl3-vec':

       1086.571078      task-clock (msec)         #    1.000 CPUs utilized
     4,072,679,849      cycles                    #    3.748 GHz
     2,629,419,883      instructions              #    0.65  insns per cycle
                                                  #    1.27  stalled cycles per insn
     4,028,715,968      r1b1                      # 3707.733 M/sec  # unfused uops
     2,257,875,023      r10e                      # 2077.982 M/sec  # fused uops.  lower than insns because of macro-fusion
     3,328,275,626      stalled-cycles-frontend   #   81.72% frontend cycles idle
     1,648,011,059      stalled-cycles-backend    #   40.47% backend  cycles idle
       751,736,741      L1-dcache-load-misses     #  691.843 M/sec
            18,772      cache-misses              #    0.017 M/sec

       1.086925466 seconds time elapsed

I still don’t know why it’s getting such low instructions per cycle. The inner loop is using 4 separate accumulators, and I checked with gdb that the pointers are aligned. So cache-bank conflicts shouldn’t be the problem. Sandybridge L2 cache can sustain one 32B transfers per cycle, which should keep up with the one 32B FP vector add per cycle.

32B loads from L1 take 2 cycles (it wasn’t until Haswell that Intel made 32B loads a single-cycle operation). However, there are 2 load ports, so the sustained throughput is 32B per cycle (which we’re not reaching).

Perhaps the loads need to be pipelined ahead of when they’re used, to minimize having the ROB (re-order buffer) fill up when a load stalls? But the perf counters indicate a fairly high L1 cache hit rate, so hardware prefetch from L2 to L1 seems to be doing its job.

0.65 instructions per cycle is only about half way to saturating the vector FP adder. This is frustrating. Even IACA says the loop should run in 4 cycles per iteration. (i.e. saturate the load ports and port1 (where the FP adder lives)) :/

update: I guess L2 bandwidth was the problem after all. There aren’t enough line-fill buffers to keep enough misses in flight to sustain the peak throughput every cycle. L2 sustained bandwidth is less than peak on Intel SnB / Haswell / Skylake CPUs.

See also Single Threaded Memory Bandwidth on Sandy Bridge (Intel forum thread, with much discussion about what limits throughput, and how latency * max_concurrency is one possible bottleneck. See also the “Latency Bound Platforms” part of the answer to Enhanced REP MOVSB for memcpy limited memory concurrency is a bottleneck for loads as well as stores, but for loads prefetch into L2 does mean you might not be limited purely by Line Fill buffers for outstanding L1D misses.

Reducing ARRAY_SIZE to 1008 (multiple of 16), and increasing N_TIMES by a factor of 10, brought the runtime down to 0.5s. That’s 1.68 insns per cycle. (The inner loop is 7 total instructions for 4 FP adds, thus we are finally saturating the vector FP add unit, and the load ports.) Loop tiling is a much better solution, see below.

Intel CPUs only have 32k each L1-data and L1-instruction caches. I think your array would just barely fit in the 64kiB L1D on an AMD K10 (Istanbul) CPU, but not Bulldozer-family (16kiB L1D) or Ryzen (32kiB L1D).

Gcc’s attempt to vectorize by broadcasting the same value into a parallel add doesn’t seem so crazy. If it had managed to get this right (using multiple accumulators to hide latency), that would have allowed it to saturate the vector FP adder with only half the memory bandwidth. As-is, it was pretty much a wash, probably because of overhead in broadcasting.

Also, it’s pretty silly. The N_TIMES is a just a make-work repeat. We don’t actually want to optimize for doing the identical work multiple times. Unless we want to win at silly assignments like this. A source-level way to do this would be to increment i in the part of the code we’re allowed to modify:

for (...) {
    sum += a[j] + a[j] + a[j] + a[j];
}
i += 3;  // The inner loop does 4 total iterations of the outer loop

More realistically, to deal with this you could interchange your loops (loop over the array once, adding each value N_TIMES times). I think I’ve read that Intel’s compiler will sometimes do that for you.

A more general technique is called cache blocking, or loop tiling. The idea is to work on your input data in small blocks that fit in cache. Depending on your algorithm, it can be possible to do various stages of thing on a chunk, then repeat for the next chunk, instead of having each stage loop over the whole input. As always, once you know the right name for a trick (and that it exists at all), you can google up a ton of info.

You could rules-lawyer your way into putting an interchanged loop inside an if (i == 0) block in the part of the code you’re allowed to modify. It would still do the same number of additions, but in a more cache-optimal order.

Compiler Options

Source level changes

More Related Contents:

Leave a Comment Cancel reply