Why is this C++ wrapper class not being inlined away?

It is inlined, but not optimized away because you compiled with -O0 (the default). That generates asm for consistent debugging, allowing you to modify any C++ variable while stopped at a breakpoint on any line.

This means the compiler spills everything from registers after every statement, and reloads what it needs for the next. So more statements to express the same logic = slower code, whether they’re in the same function or not.
Why does clang produce inefficient asm for this simple floating point sum (with -O0)? explains in more detail.

Normally -O0 won’t inline functions, but it does respect __attribute__((always_inline)).

C loop optimization help for final assignment explains why benchmarking or tuning with -O0 is totally pointless. Both versions are ridiculous garbage for performance.

If it wasn’t inlined, there’d be a call instruction that called it inside the loop.

The asm is actually creating the pointers in registers for const WrappedDouble& left and right. (very inefficiently, using multiple instructions instead of one lea. The addq %rdx, %rax is the final step in one of those.)

Then it spills those pointer args to stack memory, because they’re real variables and have to be in memory where a debugger could modify them. That’s what movq %rax, -16(%rbp) and %rdx … is doing.

After reloading and dereferencing those pointers, the addsd (add scalar double) result is itself spilled back to a local in stack memory with movsd %xmm0, -8(%rbp). This isn’t a named variable, it’s the return value of the function.

It’s then reloaded and copied again to another stack location, then finally arr and i are loaded from the stack, along with the double result of operator+, and that’s stored into arr[i] with movq %rsi, (%rax,%rdx,8). (Yes, LLVM used a 64-bit integer mov to copy a double that time. The earlier times used SSE2 movsd.)

All of those copies of the return value are on the critical path for the loop-carried dependency chain, because the next iteration reads arr[i-1]. Those ~5 or 6 cycle store-forwarding latencies really add up vs. 3 or 4 cycle FP add latency.

Obviously that’s massively inefficient. With optimization enabled, gcc and clang have no trouble inlining and optimizing away your wrapper.

They also optimize by keeping around the arr[i] result in a register for use as the arr[i-1] result in the next iteration. This avoids the ~6 cycle store-forwarding latency that would otherwise be inside the loop, if it made asm like the source.

i.e. the optimized asm looks kind of like this C++:

double tmp = arr[0];   // kept in XMM0

for(...) {
   tmp += arr[i];   // no re-read of mmeory
   arr[i] = tmp;
}

Amusingly, clang doesn’t bother to initialize its tmp (xmm0) before the loop, because you don’t bother to initialize the array. Strange it doesn’t warn about UB. In practice a big malloc with glibc’s implementation will give you fresh pages from the OS, and they will all hold zeros, i.e. 0.0. But clang will give you whatever was left around in XMM0! If you add a ((double*)arr)[0] = 1;, clang will load the first element before the loop.

Unfortunately the compiler doesn’t know how to do any better than that for your Prefix Sum calculation. See parallel prefix (cumulative) sum with SSE and SIMD prefix sum on Intel cpu for ways to speed this up by another factor of maybe 2, and/or parallelize it.

I prefer Intel syntax, but the Godbolt compiler explorer can give you AT&T syntax like in your question if you like.

# gcc8.2 -O3 -march=haswell -Wall
.LC1:
    .string "done"
main:
    sub     rsp, 8
    mov     edi, 800000000
    call    malloc                  # return value in RAX

    vmovsd  xmm0, QWORD PTR [rax]   # load first elmeent
    lea     rdx, [rax+8]            # p = &arr[1]
    lea     rcx, [rax+800000000]    # endp = arr + len

.L2:                                   # do {
    vaddsd  xmm0, xmm0, QWORD PTR [rdx]   # tmp += *p
    add     rdx, 8                        # p++
    vmovsd  QWORD PTR [rdx-8], xmm0       # p[-1] = tmp
    cmp     rdx, rcx
    jne     .L2                        # }while(p != endp);

    mov     rdi, rax
    call    free
    mov     edi, OFFSET FLAT:.LC0
    call    puts
    xor     eax, eax
    add     rsp, 8
    ret

Clang unrolls a bit, and like I said doesn’t bother to init its tmp.

# just the inner loop from clang -O3
# with -march=haswell it unrolls a lot more, so I left that out.
# hence the 2-operand SSE2 addsd instead of 3-operand AVX vaddsd
.LBB0_1:                                # do {
    addsd   xmm0, qword ptr [rax + 8*rcx - 16]
    movsd   qword ptr [rax + 8*rcx - 16], xmm0
    addsd   xmm0, qword ptr [rax + 8*rcx - 8]
    movsd   qword ptr [rax + 8*rcx - 8], xmm0
    addsd   xmm0, qword ptr [rax + 8*rcx]
    movsd   qword ptr [rax + 8*rcx], xmm0
    add     rcx, 3                            # i += 3
    cmp     rcx, 100000002
    jne     .LBB0_1                      } while(i!=100000002)

Apple XCode’s gcc is really clang/LLVM in disguise, on modern OS X systems.

More Related Contents:

Leave a Comment Cancel reply