Simple for() loop benchmark takes the same time with any loop bound

Question

BTW, if you’d actually done i<49058349083, gcc and clang create an infinite loop on systems with 32-bit int (including x86 and x86-64). 49058349083 is greater than INT_MAX. Large literal numbers are implicitly promoted to a type large enough to hold them, so you effectively did (int64_t)i < 49058349083LL, which is true for any possible value of int i.

Signed overflow is Undefined Behaviour in C++, and so is an infinite loop with no side-effects inside the loop (e.g. a system call), so I checked on the Godbolt compiler explorer to see how it really compiles with optimization enabled. Fun fact: MSVC still optimizes away an empty loop (including one with i*10 not assigned to anything) when the condition is an always-true compare, rather than a non-zero constant like 42.

A loop like this is pretty fundamentally flawed.

You can microbenchmark a complete non-inline function using Google’s benchmark package (How would you benchmark the performance of a function?), but to learn anything useful by putting something inside a repeat-loop you have to know a lot about how your compiler compiles to asm, exactly what you’re trying to measure, and how to get the optimizer to make asm that’s similar to what you’d get from your code in some real use context. e.g. by using inline asm to require it to have a certain result in a register, or by assigning to a volatile variable (which also has the overhead of doing a store).

If this sound a lot more complicated than you were hoping, well too bad, it is, and for good reason.

That’s because compilers are incredibly complex pieces of machinery that can usually produce fairly efficient executables out of source that’s written to clearly express what’s going on for human readers, not to avoid redundant work or look anything like an efficient machine-language implementation (which is what your CPU actually runs).

Microbenchmarking is hard – you can’t get meaningful results unless you understand how your code compiles and what you’re really measuring.

Optimizing compilers are designed to create an executable that produces the same results as your C++ source, but which runs as quickly as possible. Performance is not an observable result, so it’s always legal to make a program more efficient. This is the “as-if” rule: What exactly is the “as-if” rule?

You want your compiler to not waste time and code-size computing results that aren’t used. After the compiler inlines a function into the caller, it often turns out that some of the things it computes aren’t used. It’s normal for well-written C++ code to have lots of work that can be thrown away, including optimizing away temporary variables entirely; this is not a bad thing and a compiler that didn’t do this would suck.

Remember that you’re writing for the C++ abstract machine, but your compiler’s job is to translate that into assembly language (or machine code) for your CPU. Assembly language is quite different from C++. (And modern higher-performance CPUs can also execute instructions out of order, while following their own “as-if” rule to preserve the illusion of the compiler-generated code running in program order. But CPUs can’t discard work, only overlap it.)

You can’t microbenchmark the int * int binary operator in C++ in the general case, even just for your own desktop (nevermind on other hardware / different compilers). Different uses in different contexts will compile to different code. Even if you could create some version of your loop which measured something useful, it wouldn’t necessarily tell you much about how expensive foo = a * b is in another program. The other program might bottleneck on multiply latency instead of throughput, or combine that multiply with some other nearby operation on a or b, or any number of things.

Don’t disable optimization

You might think it would be useful to just disable optimization (like gcc -O0 instead of gcc -O3). But the only way to do that also introduces anti-optimizations like storing every value back to memory after every C++ statement, and reloading variables from memory for the next statement. This lets you modify variable values while debugging a compiled program, or even jump to a new line in the same function, and still get the results you’d expect from looking at the C++ source.

Supporting that level of interference imposes a huge burden on the compiler. Store/reload (store-forwarding) has about 5 cycle latency on a typical modern x86. This means an anti-optimized loop can only run one iteration per ~6 clock cycles at best, vs. 1 cycle for a tight loop like looptop: dec eax / jnz looptop where the loop counter is in a register.

There’s not much middle ground: compilers don’t have options to make asm that “looks like” the C++ source, but keeping values in registers across statements. That wouldn’t be useful or meaningful anyway, because that’s not how they compile with full optimization enabled. (gcc -Og might be a little bit like this.)

Spending time modifying your C++ to make it run faster with -O0 is a total waste of time: C loop optimization help for final assignment (with compiler optimization disabled).

Note: anti-optimized debug mode (-O0) is the default for most compilers. It’s also “compile-fast” mode, so it’s good for seeing if your code compiles / runs at all, but it’s useless for benchmarking. The resulting compiler-generated asm runs the speed it does for reasons which depend on the hardware, but don’t tell you much of anything about how fast optimized code will run. (e.g. the answer to Adding a redundant assignment speeds up code when compiled without optimization is some fairly subtle Intel Sandybridge-family store-forwarding latency behaviour, directly caused by the store/reloads and having the loop bottleneck on the loop counter being in memory.
Note that the first version of the question was asking about why doing that made the C faster, which was rightly downvoted because benchmarking -O0 is silly. It only turned into an interesting question when I edited it into an x86 asm question which is interesting because of the larger-but-faster asm, not because it came from gcc -O0 with any particular source changes.)

And this is not even mentioning C++ standard library functions like std::sort or std::vector::push_back, which depend on the optimizer to inline lots of nested calls to tiny helper / wrapper functions.

How compilers optimize

(I’m going to show transformations of C++ code. Remember that the compiler is actually transforming an internal representation of your program’s logic and then producing asm. You can think of the transformed C++ as being pseudo-code for asm, where i++ represents an x86 inc eax instruction or something. Most C/C++ statements can map to 1 or a couple machine instructions. So it’s a useful way of describing the logic of what the actual compiler-generated asm might be doing without actually writing it in asm.)

A result which is never used doesn’t have to be computed in the first place. A loop with no side effects can be removed completely. Or a loop that assigns to a global variable (an observable side effect) could be optimized to just doing the last assignment. e.g.

// int gsink;  is a global.  
// "sink" is the opposite of a source: something we dump results into.
for (int i=0 ; i<n ; i++) {
    gsink = i*10;
}

is equivalent to this code, as far as an optimizing compiler is concerned:

if ( 0 < n ) {      // you might not have noticed that your loop could run 0 times
    gsink = (n-1)*10; // but the compiler isn't allowed to do gsink=0 if n<1
}

If gsink were instead a local or file-scoped static with nothing that reads it, the compiler could optimize it away entirely. But the compiler can’t “see” code outside the current C++ source file (“compilation unit”) while compiling a function containing that, so it can’t change the observable side-effect that when the function returns, gsink = n*10;.

Nothing reads the intermediate values of gsink because there are no function calls to non-inline function. (Because it’s not an atomic<int>, the compiler can assume that no other threads or signal handlers read it; that would be data-race Undefined Behaviour.)

Using `volatile` to get the compiler to do some work.

If it were global volatile int gsink, the actual store that places the value in memory would be an observable side-effect (that’s what volatile means in C++). But does that mean we can benchmark multiplication this way? No, it doesn’t. The side-effect the compiler has to preserve is only the placing of the final value in memory. If it can calculate it more cheaply than with i * 10 every time through the loop, it will do so.

This loop would also produce the same result sequence of assignments to gsink, and thus is a valid option for an optimizing compiler. Transforming independent multiplies to a loop-carried add is called a “strength-reduction” optimization.

volatile int gsink;

int i10 = 0;   // I could have packed this all into a for() loop
int i=0;       // but this is more readable
while (i<n) {
    gsink = i10;
    i10 += 10;
    i++;
}

Could the compiler drop i altogether and use i10 < n*10 as the loop condition? (Of course hoisting the upperbound = n*10 calculation out of the loop.)

That wouldn’t always give the same behaviour, because n*10 can overflow, so the loop can run at most INT_MAX/10 times if implemented that way. But signed overflow in C++ is undefined behaviour, and the i*10 in the loop body would overflow in any program where n*10 overflowed, so the compiler can safely introduce an n*10 without changing the behaviour of any legal / well-defined program. See What Every C Programmer Should Know About Undefined Behavior.

(Actually, i*10 is only evaluated for i up to n-1 at most, and n*10 could overflow while (n-1)*10 doesn’t. What gcc actually does is more like while(i10 != n*10) using unsigned math, when compiling for x86. x86 is a 2’s complement machine, so unsigned and signed are the same binary operation, and checking for != instead of signed less-than is safe even if (unsigned)n*10UL == 0x8000000UL, which is INT_MIN.)

For more about looking at compiler asm output, and a total-beginner intro to x86 asm, see Matt Godbolt’s CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler’s Lid”. (And related: How to remove “noise” from GCC/clang assembly output?). See http://agner.org/optimize/ for more about how current x86 CPUs perform.

Compiler output for this function from gcc7.3 -O3, compiling for x86-64, on the Godbolt compiler explorer:

volatile int gvsink;

void store_n(int n) {
    for(int i=0 ; i<n; i++) {
        gvsink = i*10;
    }
}

store_n(int):          # n in EDI  (x86-64 System V calling convention)
    test    edi, edi
    jle     .L5                   # if(n<=0) goto end

    lea     edx, [rdi+rdi*4]      # edx = n * 5
    xor     eax, eax              # tmp = 0
    add     edx, edx              # edx = n * 10

.L7:                                   # do {
    mov     DWORD PTR gvsink[rip], eax   # gvsink = tmp
    add     eax, 10                      # tmp += 10
    cmp     eax, edx
    jne     .L7                        # } while(tmp != n*10)
.L5:
    rep ret

The optimal / idiomatic asm loop structure is a do{}while(), so compilers always try to make loops like this. (That doesn’t mean you have to write your source that way, but you can let the compiler avoid the check for zero iterations in cases like this where it can’t prove that.)

If we’d used unsigned int, overflow would be well-defined as wraparound, so there’s no UB the compiler can use as an excuse to compile your code in a way you didn’t expect. (Modern C++ is not a forgiving language. Optimizing compilers are quite hostile to programmers who don’t carefully avoid any UB, and this can get pretty subtle because a lot of stuff is undefined behaviour. Compiling C++ for x86 is not like writing x86 assembly. But fortunately there are compiler options like gcc -fsanitize=undefined which will detect and warn about UB at runtime. You still have to check every possible input value you care about, though.)

void store_n(unsigned int n) {
    for(unsigned int i=0 ; i<n; i++) {
        gvsink = i*10;
    }
}

store_n(unsigned int):
    test    edi, edi
    je      .L9            # if (n==0) return;
    xor     edx, edx       # i10 = 0
    xor     eax, eax       # i = 0
.L11:                      # do{
    add     eax, 1         #    i++
    mov     DWORD PTR gvsink[rip], edx
    add     edx, 10        #    i10 += 10
    cmp     edi, eax       
    jne     .L11           # } while(i!=n)
.L9:
    rep ret

Clang compiles with two separate counters for signed and unsigned. Clang’s code is more like

i10 = 0;
do {
    gvsink = i10;
    i10 += 10;
} while(--n != 0);

So it counts down the n register toward zero, avoiding a separate cmp instruction because add/sub instructions also set the condition-code flags which the CPU can branch on. (Clang unrolls small loops by default, generating i10, i10 + 10, i10 + 20, up to i10 + 70 in registers it can store from, while only running the loop-overhead instructions once. There’s not a lot to be gained here from unrolling on typical modern CPUs, though. One store per clock cycle is a bottleneck, and so is 4 uops per clock (on Intel CPUs) issuing from the front-end into the out-of-order part of the core.

Getting the compiler to multiply:

How do we stop that strength-reduction optimization? Replacing *10 with * variable doesn’t work, then we just get code that adds are register instead of adding an immediate constant.

We could turn it into a loop over arrays like a[i] = b[i] * 10;, but then we’d be dependent on memory bandwidth as well. Also, that could auto-vectorize with SIMD instructions, which we might or might not want to be testing.

We could do something like tmp *= i; (with unsigned, to avoid signed-overflow UB). But that makes the output of the multiply in each iteration an input for the next. So we’d be benchmarking multiply latency, not throughput. (CPUs are pipelined, and for example can start a new multiply every clock cycle, but the result of a single multiply isn’t ready until 3 clock cycles later. So you’d need at least tmp1*=i, tmp2*=i, and tmp3*=i to keep the integer multiply unit on an Intel Sandybridge-family CPU saturated with work.

This comes back to having to know exactly what you’re measuring to make a meaningful microbenchmark at this level of detail.

If this answer is going way over your head, stick to timing whole functions! It really isn’t possible to say much about a single C arithmetic operator or an expression without understanding the surrounding context, and how it works in asm.

If you understand caching, you can understand memory access and arrays vs. linked lists pretty decently without getting into too much asm-level detail. It’s possible to understand C++ performance in some level of detail without knowing much about asm (beyond the fact that it exists, and that compilers optimize heavily). But the more you know about asm, CPU performance tuning, and how compilers work, the more things start to make sense.

PS:

Any computation based on compile-time-constant values can (and hopefully is) done at compile time. This is called “constant propagation”. Hiding constants from the optimizer (e.g. by inputting them as command line args (atoi(argv[1]), or with other tricks) can make the compiler-generated code for a microbenchmark look more like the real use-case, if that use-case also can’t see constants at compile time. (But note that constants defined in other files become visible with link-time optimization, which is very good for projects with lots of small functions that call each other across source file boundaries and aren’t defined in headers (.h) where they can inline normally.)

Multiplying by 16 (or any other power of 2) will use a shift, because that’s more efficient than an actual multiply instruction. This is a big deal for division, especially. See Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?, and Why does GCC use multiplication by a strange number in implementing integer division?.

Other multiply constants with only a couple bits set in their binary representation can be done with some shift+add, often with lower latency than a general-purpose multiply instruction. See for example How to multiply a register by 37 using only 2 consecutive leal instructions in x86?.

None of these optimizations are possible with a * b or a / b if neither input is known at compile time.

See also: How can I benchmark the performance of C++ code?, especially the links to Chandler Carruth’s CppCon 2015 talk: “Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!”.

And because it’s worth mentioning twice: Matt Godbolt’s CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler’s Lid”. It’s a gentle enough introduction that a beginner can probably follow it well enough to look at how their loop compiled, and see if it optimized away or not.

Microbenchmarking is hard – you can’t get meaningful results unless you understand how your code compiles and what you’re really measuring.

Don’t disable optimization

How compilers optimize

Using volatile to get the compiler to do some work.

Getting the compiler to multiply:

More Related Contents:

Leave a Comment Cancel reply

Using `volatile` to get the compiler to do some work.