How can I mitigate the impact of the Intel jcc erratum on gcc?

By compiler:


The GNU toolchain does mitigation in the assembler, with as -mbranches-within-32B-boundaries, which enables (GAS manual: x86 options):

  • -malign-branch-boundary=32 (care about 32-byte boundaries). Except the manual says this option takes an exponent, not a direct power of 2, so probably it’s actually ...boundary=5.
  • -malign-branch=jcc+fused+jmp (the default which does not include any of +call+ret+indirect)
  • -malign-branch-prefix-size=5 (up to 5 segment prefixes per insn).

So the relevant GCC invocation is gcc -Wa,-mbranches-within-32B-boundaries
Unfortunately, GCC -mtune=skylake doesn’t enable this.

GAS’s strategy seems to be to pad as early as possible after the last alignment directive (e.g. .p2align) or after the last jcc/jmp that can end before a 32B boundary. I guess that might end up with padding in outer loops, before or after inner loops, maybe helping them fit in fewer uop cache lines? (Skylake also has its LSD loop buffer disabled, so a tiny loop split across two uop cache lines can run at best 2 cycles per iteration, instead of 1.)

It can lead to quite a large amount of padding with long macro-fused jumps, such as with -fstack-protector-strong which in recent GCC uses sub rdx,QWORD PTR fs:0x28 / jnz (earlier GCC used to use xor, which can’t fuse even on Intel). That’s 11 bytes total of sub + jnz, so could require 11 bytes of CS prefixes in the worst case to shift it to the start of a new 32B block. Example showing 8 CS prefixes in the insns before it: https://godbolt.org/z/n1dYGMdro


GCC doesn’t know instruction sizes, it only prints text. That’s why it needs GAS to support stuff like .p2align 4,,10 to align by 16 if that will take fewer than 10 bytes of padding, to implement the alignment heuristics it wants to use. (Often followed by .p2align 3 to unconditionally align by 8.)

as has other fun options that aren’t on by default, like -Os to optimize hand-written asm like mov $1, %rax => mov $1, %eax / xor %rax,%rax => %eax / test $1, %eax => al and even EVEX => VEX for stuff like vmovdqa64 => vmovdqa.

Also stuff like -msse2avx to always use VEX prefixes even when the mnemonic isn’t v..., and -momit-lock-prefix=yes which could be used to build std::atomic code for a uniprocessor system.

And -mfence-as-lock-add=yes to assemble mfence into lock addl $0x0, (%rsp). But insanely it also does that for sfence and even lfence, so it’s unusable in code that uses lfence as an execution barrier, which is the primary use-case for lfence. e.g. for retpolines or timing like lfence;rdtsc.

as also has CPU feature-level checking with -march=znver3 for example, or .arch directives. And -mtune=CPU, although IDK what that does. Perhaps set NOP strategy?

Leave a Comment