clang 14.0.0 floating point optimizations

This is a pretty deep rabbit hole, and I don’t know if I’ve explored all of its twists and turns yet. But here’s a first draft of an answer; suggestions for improvement are welcome.

At its core, the culprit is the so-called “fused multiply-add” (or, in this case, a fused multiply-subtract). Fused multiply-add is a single instruction which computes a*b+c in a single step. This can significantly speed up certain computations (such as dot products and polynomials using Horner’s rule). It was added to Intel’s x86 instruction set in about 2013 (Haswell); a similar instruction was added to AMD chips a year earlier. But the idea is not new; high-end processors have included such instructions at least since 1990 (with IBM’s POWER1 processor).

Because the result of the fused operation is only once (instead of being rounded twice after the multiply and again after the add), it often produces more accurate results. Unfortunately, there are cases where it produces less accurate results, and this is one of them; it’s triggered by the computation of a*b-c where a*b and c are very similar, and c was previously rounded. [Note 1] To see the problem in action, it’s useful to reduce the code to a minimum, whose result is at least surprising:

#include <stdio.h>
int main (void) {
    double A = 373737.0;
    printf("A*A*A - A*A*A is %f.\n", A*A*A - A*A*A);
    return 0;
}

With clang since v14.0.0, that prints out 1.000000. [Note 2] The result is 1 (rather than -1) because the expression A*A*A - A*A*A is turned into a fused multiply-subtract of A*A, A, and A*A*A. Now, 373737³ is exactly 52203339425426553, a 56-bit number. Since double on an x86 platform allows only 53 significant bits, that needs to be rounded to the nearest representable value, which is 52203339425426552. In the fused operation, 373737² * 373737 is computed exactly, and then the rounded value of 373737³ is subtracted, leaving 1.

In the original program, the computation was (approximately) 373737³ + 1e-6 – 373737³ – 1e-6. In this computation, 373737³ + 1e-6 is first computed (using FMA) and rounded, which again is 52203339425426552; adding 1e-6 has no effect on the rounded sum. Then a fused negated-multiply-add is performed, adding 52203339425426552 and the precise negated product of 373737² and 373737 (-52203339425426553); the result is exactly -1. Finally, 1e-6 is subtracted, leading to the observed result of -1.000001.

That’s the essence of what Goldberg calls “catastrophic cancellation” (See note 1 if you haven’t already read it); the subtraction of two very similar values cancels out all significance.

(On the other hand, with some care you can use the fact that the multiplication in the fused operation was not rounded in order to produce a more accurate final result, using an algorithm due to the Canadian mathematician William Kahan, primary architect of the IEEE-754 standards. See, for example, this enlightening answer by @njuffa on how to accurately compute quadratic roots when b² is close to 4ac.)

So what changed with Clang v14.0.0? Both Clang and GCC have an option which controls whether FMA is used: -ffp-contract. (In the C standard, FMA is one of the examples of “contracted operations”, and this option controls all such operations.) That option has three possible values: off, on and fast. off always means that the compiler will not fuse multiplies and adds when compiling expressions. (It will still compile the fma function into an FMA opcode, if that opcode is available on the target machine.) Up until v13.0.0, off was the default for Clang; with v14.0.0, the default was changed to on, which allows fusing multiply and add in the same expression. Since then, Clang will, by default, emit FMA instructions if the target architecture implements them. More relevantly for this question, it will also emulate FMA for constant computations performed at compile-time.

Although GCC has the same option, the semantics are somewhat different. As far as I know, GCC does not emulate FMA for compile-time computations. Furthermore, GCC interprets -ffp-contract=on as being the same as -ffp-contract=off (!), and its default is -ffp-contract=fast. The fast setting allows contracted operations not only within expressions (which is allowed by standard C) but also in computations which span different expressions. However, for this particular computation, GCC’s optimizer prefers to save and reuse the value of the common subexpression A*A*A, rather than emitting an FMA. [Note 3]

Clang also allows -ffp-contract=fast, with approximately the same semantics as GCC, but the result of specifying that option is that the constant folder cannot emulate FMA. [Note 4]

The C standard actually defines a portable mechanism to control the use of contracted operations: the #pragma STDC FP_CONTRACT, with possible values ON, OFF and DEFAULT. OFF is required to suppress emission of FMA operations, but the standard places no other restriction; the default can be ON and OFF, and ON is not required to do anything in particular. However, GCC does not implement this pragma (as of GCC v12), so it’s not as portable as one might wish. (Clang does implement, though.)

Although, as this question shows, the use of fused multiply-add can have surprising results, and it’s easy to fall into the trap of assuming that such results are compiler bugs, it’s pretty clear that the standard does intend that compilers are free to use FMA and other contracted operations, as long as there is a way of turning the feature off, as indicated in §6.5 paragraph 8, whose wording has not changed since C99:

A floating expression may be contracted, that is, evaluated as though it were a single operation, thereby omitting rounding errors implied by the source code and the expression evaluation method. The FP_CONTRACT pragma in <math.h> provides a way to disallow contracted expressions. Otherwise, whether and how expressions are contracted is implementation-defined.

The clause is accompanied by this footnote:

This license is specifically intended to allow implementations to exploit fast machine instructions that combine multiple C operators. As contractions potentially undermine predictability, and can even decrease accuracy for containing expressions, their use needs to be well-defined and clearly documented.

It has been argued that the requirements in Appendix F for IEC-559 compliance (usually described as IEEE-754/854) override the licence explicitly mentioned above, but I don’t find this argument convincing. First, §6.5, as cited above, is pretty clear. Second, Appendix F also contemplates contracted expressions in §F.7:

A contracted expression is correctly rounded (once) and treats infinities, NaNs, signed zeros, subnormals, and the rounding directions in a manner consistent with the basic arithmetic operations covered by IEC 60559.

Third, IEEE-754 (2008, Note 5) is explicit in allowing implementations to implement contracted operations, as long as they provide a way to turn it off:

A language standard should require that by default, when no optimizations are enabled and no alternate exception handling is enabled, language implementations preserve the literal meaning of the source code.

A language standard should also define, and require implementations to provide, attributes that allow and disallow value-changing optimizations, separately or collectively, for a block. These optimizations might include, but are not limited to:

  • Applying the associative or distributive laws.
  • Synthesis of a fusedMultiplyAdd operation from a multiplication and an addition.

I say all that with a certain pain, since I was also pretty sure that this behaviour was buggy. The unpredictability of application of FMA seems less than ideal. On the other hand, the standard defines the fma function, which should (and normally does) get compiled in-line into an appropriate machine instruction, and there are mechanisms to require compilers not to issue contracted expressions unless asked to explicitly, which I’m definitely going to consider using with more consistency.

Notes

  1. This is the scenario described as “catastrophic cancellation” by David Goldberg in the essay What every computer scientist should know about floating point arithmetic, which is inevitably cited by any discussion of a floating-point quirk. By “cancellation”, Goldberg means that significant digits are cancelled out by the subtraction, potentially leaving only digits within the error bounds.

  2. At least, if you have the right compiler options specified. With default compiler options, you’ll get 0.

    As noted in the OP, the odd result doesn’t happen with the default compiler settings. That’s because the default is no optimisations. With any optimisation enabled, Clang will fold constant expressions at compile time, and the constant folder emulates fused multiply-add. Without optimisation, the computation is done at run-time, and by default, Clang doesn’t emit FMA instructions because they are not available on all supported x86 chips. You need to specify -mfma (or some other similar target selector) to indicate that the target architecture includes the FMA instruction set in order to see FMA instructions in the compiled binary.

  3. I don’t know whether GCC’s constant folder emulates FMA; if I figure that out later, I’ll edit this paragraph.

  4. The reason that -ffp-contract=fast suppresses FMA in the constant folder is explained by LLVM committer Andy Kaylor in a comment to bug 54927.

  5. I don’t have a copy of the later versions, but I suspect the essence hasn’t changed.

Leave a Comment