How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate).

An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two.

The IEEE and C standards allow this when #pragma STDC FP_CONTRACT ON is in effect, and compilers are allowed to have it ON by default (but not all do). Gcc contracts into FMA by default (with the default -std=gnu*, but not -std=c*, e.g. -std=c++14). For Clang, it’s only enabled with -ffp-contract=fast. (With just the #pragma enabled, only within a single expression like a+b*c, not across separate C++ statements.).

This is different from strict vs. relaxed floating point (or in gcc terms, -ffast-math vs. -fno-fast-math) that would allow other kinds of optimizations that could increase the rounding error depending on the input values. This one is special because of the infinite precision of the FMA internal temporary; if there was any rounding at all in the internal temporary, this wouldn’t be allowed in strict FP.

Even if you enable relaxed floating-point, the compiler might still choose not to fuse since it might expect you to know what you’re doing if you’re already using intrinsics.


So the best way to make sure you actually get the FMA instructions you want is you actually use the provided intrinsics for them:

FMA3 Intrinsics: (AVX2 – Intel Haswell)

  • _mm_fmadd_pd(), _mm256_fmadd_pd()
  • _mm_fmadd_ps(), _mm256_fmadd_ps()
  • and about a gazillion other variations…

FMA4 Intrinsics: (XOP – AMD Bulldozer)

  • _mm_macc_pd(), _mm256_macc_pd()
  • _mm_macc_ps(), _mm256_macc_ps()
  • and about a gazillion other variations…

Leave a Comment