fma - w3toppers.com

AVX2: Computing dot product of 512 float arrays

_mm256_dp_ps is only useful for dot-products of 2 to 4 elements; for longer vectors use vertical SIMD in a loop and reduce to scalar at the end. Using _mm256_dp_ps and _mm256_add_ps in a loop would be much slower. GCC and clang require you to enable (with command line options) ISA extensions that you use intrinsics … Read more

Preventing GCC from automatically using AVX and FMA instructions when compiled with -mavx and -mfma

What you want to do is compile different object files for each instruction set you are targeting. Then create a cpu dispatcher which asks CPUID for the available instruction set and then jumps to the appropriate version of the function. I already described this in several different questions and answers disable-avx2-functions-on-non-haswell-processors do-i-need-to-make-multiple-executables-for-targetting-different-instruction-set how-to-check-with-intel-intrinsics-if-avx-extensions-is-supported-by-the-cpu cpu-dispatcher-for-visual-studio-for-avx-and-sse create-separate-object-files-from-the-same-source-code-and-link-to-an-executable

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

Yes it’s possible. But as of AVX2, it’s unlikely to be better than the scalar approaches with MULX/ADCX/ADOX. There’s virtually an unlimited number of variations of this approach for different input/output domains. I’ll only cover 3 of them, but they are easy to generalize once you know how they work. Disclaimers: All solutions here assume … Read more

Optimize for fast multiplication but slow addition: FMA and doubledouble

To answer my third question I found a faster solution for double-double addition. I found an alternative definition in the paper Implementation of float-float operators on graphics hardware. Theorem 5 (Add22 theorem) Let be ah+al and bh+bl the float-float arguments of the following algorithm: Add22 (ah ,al ,bh ,bl) 1 r = ah ⊕ bh … Read more

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate). An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two. The IEEE and C standards allow this when … Read more

Fused multiply add and default rounding modes

It doesn’t violate IEEE-754, because IEEE-754 defers to languages on this point: A language standard should also define, and require implementations to provide, attributes that allow and disallow value-changing optimizations, separately or collectively, for a block. These optimizations might include, but are not limited to: … ― Synthesis of a fusedMultiplyAdd operation from a multiplication … Read more

Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

IACA Analysis Using IACA (the Intel Architecture Code Analyzer) reveals that macro-op fusion is indeed occurring, and that it is not the problem. It is Mysticial who is correct: The problem is that the store isn’t using Port 7 at all. IACA reports the following: Intel(R) Architecture Code Analyzer Version – 2.1 Analyzed File – … Read more