Why do compilers insist on using a callee-saved register here?

TL:DR:

  • Compiler internals are probably not set up to look for this optimization easily, and it’s probably only useful around small functions, not inside large functions between calls.
  • Inlining to create large functions is a better solution most of the time
  • There can be a latency vs. throughput tradeoff if foo happens not to save/restore RBX.

Compilers are complex pieces of machinery. They’re not “smart” like a human, and expensive algorithms to find every possible optimization are often not worth the cost in extra compile time.

I reported this as GCC bug 69986 – smaller code possible with -Os by using push/pop to spill/reload back in 2016; there’s been no activity or replies from GCC devs. :/

Slightly related: GCC bug 70408 – reusing the same call-preserved register would give smaller code in some cases – compiler devs told me it would take a huge amount of work for GCC to be able to do that optimization because it requires picking order of evaluation of two foo(int) calls based on what would make the target asm simpler.


If foo doesn’t save/restore rbx itself, there’s a tradeoff between throughput (instruction count) vs. an extra store/reload latency on the x -> retval dependency chain.

Compilers usually favour latency over throughput, e.g. using 2x LEA instead of imul reg, reg, 10 (3-cycle latency, 1/clock throughput), because most code averages significantly less than 4 uops / clock on typical 4-wide pipelines like Skylake. (More instructions/uops do take more space in the ROB, reducing how far ahead the same out-of-order window can see, though, and execution is actually bursty with stalls probably accounting for some of the less-than-4 uops/clock average.)

If foo does push/pop RBX, then there’s not much to gain for latency. Having the restore happen just before the ret instead of just after is probably not relevant, unless there a ret mispredict or I-cache miss that delays fetching code at the return address.

Most non-trivial functions will save/restore RBX, so it’s often not a good assumption that leaving a variable in RBX will actually mean it truly stayed in a register across the call. (Although randomizing which call-preserved registers functions choose might be a good idea to mitigate this sometimes.)


So yes push rdi / pop rax would be more efficient in this case, and this is probably a missed optimization for tiny non-leaf functions, depending on what foo does and the balance between extra store/reload latency for x vs. more instructions to save/restore the caller’s rbx.

It is possible for stack-unwind metadata to represent the changes to RSP here, just like if it had used sub rsp, 8 to spill/reload x into a stack slot. (But compilers don’t know this optimization either, of using push to reserve space and initialize a variable. What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?. And doing that for more than one local var would lead to larger .eh_frame stack unwind metadata because you’re moving the stack pointer separately with each push. That doesn’t stop compilers from using push/pop to save/restore call-preserved regs, though.)


IDK if it would be worth teaching compilers to look for this optimization

It’s maybe a good idea around a whole function, not across one call inside a function. And as I said, it’s based on the pessimistic assumption that foo will save/restore RBX anyway. (Or optimizing for throughput if you know that latency from x to return value isn’t important. But compilers don’t know that and usually optimize for latency).

If you start making that pessimistic assumption in lots of code (like around single function calls inside functions), you’ll start getting more cases where RBX isn’t saved/restored and you could have taken advantage.

You also don’t want this extra save/restore push/pop in a loop, just save/restore RBX outside the loop and use call-preserved registers in loops that make function calls. Even without loops, in the general case most functions make multiple function calls. This optimization idea could apply if you really don’t use x between any of the calls, just before the first and after the last, otherwise you have a problem of maintaining 16-byte stack alignment for each call if you’re doing one pop after a call, before another call.

Compilers are not great at tiny functions in general. But it’s not great for CPUs either. Non-inline function calls have an impact on optimization at the best of times, unless compilers can see the internals of the callee and make more assumptions than usual. A non-inline function call is an implicit memory barrier: a caller has to assume that a function might read or write any globally-accessible data, so all such vars have to be in sync with the C abstract machine. (Escape analysis allows keeping locals in registers across calls if their address hasn’t escaped the function.) Also, the compiler has to assume that the call-clobbered registers are all clobbered. This sucks for floating point in x86-64 System V, which has no call-preserved XMM registers.

Tiny functions like bar() are better off inlining into their callers. Compile with -flto so this can happen even across file boundaries in most cases. (Function pointers and shared-library boundaries can defeat this.)


I think one reason compilers haven’t bothered to try to do these optimizations is that it would require a whole bunch of different code in the compiler internals, different from the normal stack vs. register-allocation code that knows how to save call-preserved registers and use them.

i.e. it would be a lot of work to implement, and a lot of code to maintain, and if it gets over-enthusiastic about doing this it could make worse code.

And also that it’s (hopefully) not significant; if it matters, you should be inlining bar into its caller, or inlining foo into bar. This is fine unless there are a lot of different bar-like functions and foo is large, and for some reason they can’t inline into their callers.

Leave a Comment