It is inlined, but not optimized away because you compiled with -O0
(the default). That generates asm for consistent debugging, allowing you to modify any C++ variable while stopped at a breakpoint on any line.
This means the compiler spills everything from registers after every statement, and reloads what it needs for the next. So more statements to express the same logic = slower code, whether they’re in the same function or not.
Why does clang produce inefficient asm for this simple floating point sum (with -O0)? explains in more detail.
Normally -O0
won’t inline functions, but it does respect __attribute__((always_inline))
.
C loop optimization help for final assignment explains why benchmarking or tuning with -O0
is totally pointless. Both versions are ridiculous garbage for performance.
If it wasn’t inlined, there’d be a call
instruction that called it inside the loop.
The asm is actually creating the pointers in registers for const WrappedDouble& left
and right
. (very inefficiently, using multiple instructions instead of one lea
. The addq %rdx, %rax
is the final step in one of those.)
Then it spills those pointer args to stack memory, because they’re real variables and have to be in memory where a debugger could modify them. That’s what movq %rax, -16(%rbp)
and %rdx
… is doing.
After reloading and dereferencing those pointers, the addsd
(add scalar double) result is itself spilled back to a local in stack memory with movsd %xmm0, -8(%rbp)
. This isn’t a named variable, it’s the return value of the function.
It’s then reloaded and copied again to another stack location, then finally arr
and i
are loaded from the stack, along with the double
result of operator+
, and that’s stored into arr[i]
with movq %rsi, (%rax,%rdx,8)
. (Yes, LLVM used a 64-bit integer mov
to copy a double
that time. The earlier times used SSE2 movsd
.)
All of those copies of the return value are on the critical path for the loop-carried dependency chain, because the next iteration reads arr[i-1]
. Those ~5 or 6 cycle store-forwarding latencies really add up vs. 3 or 4 cycle FP add
latency.
Obviously that’s massively inefficient. With optimization enabled, gcc and clang have no trouble inlining and optimizing away your wrapper.
They also optimize by keeping around the arr[i]
result in a register for use as the arr[i-1]
result in the next iteration. This avoids the ~6 cycle store-forwarding latency that would otherwise be inside the loop, if it made asm like the source.
i.e. the optimized asm looks kind of like this C++:
double tmp = arr[0]; // kept in XMM0
for(...) {
tmp += arr[i]; // no re-read of mmeory
arr[i] = tmp;
}
Amusingly, clang doesn’t bother to initialize its tmp
(xmm0
) before the loop, because you don’t bother to initialize the array. Strange it doesn’t warn about UB. In practice a big malloc
with glibc’s implementation will give you fresh pages from the OS, and they will all hold zeros, i.e. 0.0
. But clang will give you whatever was left around in XMM0! If you add a ((double*)arr)[0] = 1;
, clang will load the first element before the loop.
Unfortunately the compiler doesn’t know how to do any better than that for your Prefix Sum calculation. See parallel prefix (cumulative) sum with SSE and SIMD prefix sum on Intel cpu for ways to speed this up by another factor of maybe 2, and/or parallelize it.
I prefer Intel syntax, but the Godbolt compiler explorer can give you AT&T syntax like in your question if you like.
# gcc8.2 -O3 -march=haswell -Wall
.LC1:
.string "done"
main:
sub rsp, 8
mov edi, 800000000
call malloc # return value in RAX
vmovsd xmm0, QWORD PTR [rax] # load first elmeent
lea rdx, [rax+8] # p = &arr[1]
lea rcx, [rax+800000000] # endp = arr + len
.L2: # do {
vaddsd xmm0, xmm0, QWORD PTR [rdx] # tmp += *p
add rdx, 8 # p++
vmovsd QWORD PTR [rdx-8], xmm0 # p[-1] = tmp
cmp rdx, rcx
jne .L2 # }while(p != endp);
mov rdi, rax
call free
mov edi, OFFSET FLAT:.LC0
call puts
xor eax, eax
add rsp, 8
ret
Clang unrolls a bit, and like I said doesn’t bother to init its tmp
.
# just the inner loop from clang -O3
# with -march=haswell it unrolls a lot more, so I left that out.
# hence the 2-operand SSE2 addsd instead of 3-operand AVX vaddsd
.LBB0_1: # do {
addsd xmm0, qword ptr [rax + 8*rcx - 16]
movsd qword ptr [rax + 8*rcx - 16], xmm0
addsd xmm0, qword ptr [rax + 8*rcx - 8]
movsd qword ptr [rax + 8*rcx - 8], xmm0
addsd xmm0, qword ptr [rax + 8*rcx]
movsd qword ptr [rax + 8*rcx], xmm0
add rcx, 3 # i += 3
cmp rcx, 100000002
jne .LBB0_1 } while(i!=100000002)
Apple XCode’s gcc
is really clang/LLVM in disguise, on modern OS X systems.