Why doesn’t gcc resolve _mm256_loadu_pd as single vmovupd?

GCC’s default tuning (-mtune=generic) includes -mavx256-split-unaligned-load and -mavx256-split-unaligned-store, because that gives a minor speedup on some CPUs (e.g. first-gen Sandybridge, and some AMD CPUs) in some cases when memory is actually misaligned at runtime.

Use -O3 -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store if you don’t want this, or better, use -mtune=haswell. Or use -march=native to optimize for your own computer. There’s no “generic-avx2” tuning. (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html).

Intel Sandybridge runs 256-bit loads as a single uop that takes 2 cycles in a load port. (Unlike AMD which decodes all 256-bit vector instructions as 2 separate uops.) Sandybridge has a problem with unaligned 256-bit loads (if the address is actually misaligned at runtime). I don’t know the details, and haven’t found much specific info on exactly what the slowdown is. Perhaps because it uses a banked cache, with 16-byte banks? But IvyBridge handles 256-bit loads better and still has banked cache.

According to the GCC mailing list message about the code that implements the option (https://gcc.gnu.org/ml/gcc-patches/2011-03/msg01847.html), “It speeds up some SPEC CPU 2006 benchmarks by up to 6%.” (I think that’s for Sandybridge, the only Intel AVX CPU that existed at the time.)


But if memory is actually 32-byte aligned at runtime, this is pure downside even on Sandybridge and most AMD CPUs1. So with this tuning option, you potentially lose just from failing to tell your compiler about alignment guarantees. And if your loop runs on aligned memory most of the time, you’d better compile at least that compilation unit with -mno-avx256-split-unaligned-load or tuning options that imply that.

Splitting in software imposes the cost all the time. Letting hardware handle it makes the aligned case perfectly efficient (except stores on Piledriver1), with the misaligned case possibly slower than with software splitting on some CPUs. So it’s the pessimistic approach, and makes sense if it’s really likely that the data really is misaligned at runtime, rather than just not guaranteed to always be aligned at compile time. e.g. maybe you have a function that’s called most of the time with aligned buffers, but you still want it to work for rare / small cases where it’s called with misaligned buffers. In that case, a split-load/store strategy is inappropriate even on Sandybridge.

It’s common for buffers to be 16-byte aligned but not 32-byte aligned because malloc on x86-64 glibc (and new in libstdc++) returns 16-byte aligned buffers (because alignof(maxalign_t) == 16). For large buffers, the pointer is normally 16 bytes after the start of a page, so it’s always misaligned for alignments larger than 16. Use aligned_alloc instead.


Note that -mavx and -mavx2 don’t change tuning options at all: gcc -O3 -mavx2 still tunes for all CPUs, including ones that can’t actually run AVX2 instructions. This is pretty dumb, because you should use a single unaligned 256-bit load if tuning for “the average AVX2 CPU”. Unfortunately gcc has no option to do that, and -mavx2 doesn’t imply -mno-avx256-split-unaligned-load or anything. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 for feature requests to have instruction-set selection influence tuning.

This is why you should use -march=native to make binaries for local use, or maybe -march=sandybridge -mtune=haswell to make binaries that can run on a wide range of machines, but will probably mostly run on newer hardware that has AVX. (Note that even Skylake Pentium/Celeron CPUs don’t have AVX or BMI2; probably on CPUs with any defects in the upper half of 256-bit execution units or register files, they disable decoding of VEX prefixes and sell them as low-end Pentium.)


gcc8.2’s tuning options are as follows. (-march=x implies -mtune=x). https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html.

I checked on the Godbolt compiler explorer by compiling with -O3 -fverbose-asm and looking at the comments which include a full dump of all implied options. I included _mm256_loadu/storeu_ps functions, and a simple float loop that can auto-vectorize, so we can also look at what the compiler does.

Use -mprefer-vector-width=256 (gcc8) or -mno-prefer-avx128 (gcc7 and earlier) to override tuning options like -mtune=bdver3 and get 256-bit auto-vectorization if you want, instead of only with manual vectorization.

  • default / -mtune=generic: both -mavx256-split-unaligned-load and -store. Arguably less and less appropriate as Intel Haswell and later become more common, and the downside on recent AMD CPUs is I think still small. Especially splitting unaligned loads, which AMD tuning options don’t enable.
  • -march=sandybridge and -march=ivybridge: split both. (I think I’ve read that IvyBridge improved handling of unaligned 256-bit loads or stores, so it’s less appropriate for cases where the data might be aligned at runtime.)
  • -march=haswell and later: neither splitting option enabled.
  • -march=knl: neither splitting option enabled. (Silvermont/Atom don’t have AVX)
  • -mtune=intel: neither splitting option enabled. Even with gcc8, auto-vectorization with -mtune=intel -mavx chooses to reach an alignment boundary for the read/write destination array, unlike gcc8’s normal strategy of just using unaligned. (Again, another case of software handling that always has a cost vs. letting the hardware deal with the exceptional case.)

  • -march=bdver1 (Bulldozer): -mavx256-split-unaligned-store, but not loads.
    It also sets the gcc8 equivalent gcc7 and earlier -mprefer-avx128 (auto-vectorization will only use 128-bit AVX, but of course intrinsics can still use 256-bit vectors).
  • -march=bdver2 (Piledriver), bdver3 (Steamroller), bdver4 (Excavator). same as Bulldozer. They auto-vectorize an FP a[i] += b[i] loop with software prefetch and enough unrolling to only prefetch once per cache line!
  • -march=znver1 (Zen): -mavx256-split-unaligned-store but not loads, still auto-vectorizing with only 128-bit, but this time without SW prefetch.
  • -march=btver2 (AMD Fam16h, aka Jaguar): neither splitting option enabled, auto-vectorizing like Bulldozer-family with only 128-bit vectors + SW prefetch.
  • -march=eden-x4 (Via Eden with AVX2): neither splitting option enabled, but the -march option doesn’t even enable -mavx, and auto-vectorization uses movlps / movhps 8-byte loads, which is really dumb. At least use movsd instead of movlps to break the false dependency. But if you enable -mavx, it uses 128-bit unaligned loads. Really weird / inconsistent behaviour here, unless there’s some strange front-end for this.

    options (enabled as part of -march=sandybridge for example, presumably also for Bulldozer-family (-march=bdver2 is piledriver). That doesn’t solve the problem when the compiler knows the memory is aligned, though.


Footnote 1: AMD Piledriver has a performance bug that makes 256-bit store throughput terrible: even vmovaps [mem], ymm aligned stores running one per 17 to 20 clocks according to Agner Fog’s microarch pdf (https://agner.org/optimize/). This effect isn’t present in Bulldozer or Steamroller/Excavator.

Agner Fog says 256-bit AVX throughput in general (not loads/stores specifically) on Bulldozer/Piledriver is typically worse than 128-bit AVX, partly because it can’t decode instructions in a 2-2 uop pattern. Steamroller makes 256-bit close to break-even (if it doesn’t cost extra shuffles). But register-register vmovaps ymm instructions still only benefit from mov-elimination for the low 128 bits on Bulldozer-family.

But closed-source software or binary distributions typically don’t have the luxury of building with -march=native on every target architecture, so there’s a tradeoff when making a binary that can run on any AVX-supporting CPU. Gaining big speedup with 256-bit code on some CPUs is typically worth it as long as there aren’t catastrophic downsides on other CPUs.

Splitting unaligned loads/stores is an attempt to avoid big problems on some CPUs. It costs extra uop throughput, and extra ALU uops, on recent CPUs. But at least vinsertf128 ymm, [mem], 1 doesn’t need the shuffle unit on port 5 on Haswell/Skylake: it can run on any vector ALU port. (And it doesn’t micro-fuse, so it costs 2 uops of front-end bandwidth.)


PS:

Most code isn’t compiled by bleeding edge compilers, so changing the “generic” tuning now will take a while before code compiled with an updated tuning will get into use. (Of course, most code is compiled with just -O2 or -O3, and this option only affects AVX code-gen anyway. But many people unfortunately use -O3 -mavx2 instead of -O3 -march=native. So they can miss out on FMA, BMI1/2, popcnt, and other things their CPU supports.

Leave a Comment