Preventing GCC from automatically using AVX and FMA instructions when compiled with -mavx and -mfma

What you want to do is compile different object files for each instruction set you are targeting. Then create a cpu dispatcher which asks CPUID for the available instruction set and then jumps to the appropriate version of the function. I already described this in several different questions and answers disable-avx2-functions-on-non-haswell-processors do-i-need-to-make-multiple-executables-for-targetting-different-instruction-set how-to-check-with-intel-intrinsics-if-avx-extensions-is-supported-by-the-cpu cpu-dispatcher-for-visual-studio-for-avx-and-sse create-separate-object-files-from-the-same-source-code-and-link-to-an-executable

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

Yes it’s possible. But as of AVX2, it’s unlikely to be better than the scalar approaches with MULX/ADCX/ADOX. There’s virtually an unlimited number of variations of this approach for different input/output domains. I’ll only cover 3 of them, but they are easy to generalize once you know how they work. Disclaimers: All solutions here assume … Read more

Optimize for fast multiplication but slow addition: FMA and doubledouble

To answer my third question I found a faster solution for double-double addition. I found an alternative definition in the paper Implementation of float-float operators on graphics hardware. Theorem 5 (Add22 theorem) Let be ah+al and bh+bl the float-float arguments of the following algorithm: Add22 (ah ,al ,bh ,bl) 1 r = ah ⊕ bh … Read more

Fused multiply add and default rounding modes

It doesn’t violate IEEE-754, because IEEE-754 defers to languages on this point: A language standard should also define, and require implementations to provide, attributes that allow and disallow value-changing optimizations, separately or collectively, for a block. These optimizations might include, but are not limited to: … ― Synthesis of a fusedMultiplyAdd operation from a multiplication … Read more