Efficient implementation of log2(__m256d) in AVX2

The usual strategy is based on the identity log(a*b) = log(a) + log(b), or in this case log2( 2^exponent * mantissa) ) = log2( 2^exponent ) + log2(mantissa). Or simplifying, exponent + log2(mantissa). The mantissa has a very limited range, 1.0 to 2.0, so a polynomial for log2(mantissa) only has to fit over that very limited range. (Or equivalently, mantissa = 0.5 to 1.0, and change the exponent bias-correction constant by 1).

A Taylor series expansion is a good starting point for the coefficients, but you usually want to minimize the max-absolute-error (or relative error) over that specific range, and Taylor series coefficients likely leave have a lower or higher outlier over that range, rather than having the max positive error nearly matching the max negative error. So you can do what’s called a minimax fit of the coefficients.

If it’s important that your function evaluates log2(1.0) to exactly 0.0, you can arrange for that to happen by actually using mantissa-1.0 as your polynomial, and no constant coefficient. 0.0 ^ n = 0.0. This greatly improves the relative error for inputs near 1.0 as well, even if the absolute error is still small.

How accurate do you need it to be, and over what range of inputs? As usual there’s a tradeoff between accuracy and speed, but fortunately it’s pretty easy to move along that scale by e.g. adding one more polynomial term (and re-fitting the coefficients), or by dropping some rounding-error avoidance.

Agner Fog’s VCL implementation of log_d() aims for very high accuracy, using tricks to avoid rounding error by avoiding things that might result in adding a small and a large number when possible. This obscures the basic design somewhat.

For a faster more approximate float log(), see the polynomial implementation on http://jrfonseca.blogspot.ca/2008/09/fast-sse2-pow-tables-or-polynomials.html. It leaves out a LOT of the extra precision-gaining tricks that VCL uses, so it’s easier to understand. It uses a polynomial approximation for the mantissa over the 1.0 to 2.0 range.

(That’s the real trick to log() implementations: you only need a polynomial that works over a small range.)

It already just does log2 instead of log, unlike VCL’s where the log-base-e is baked in to the constants and how it uses them. Reading it is probably a good starting point for understanding exponent + polynomial(mantissa) implementations of log().

Even the highest-precision version of it is not full float precision, let alone double, but you could fit a polynomial with more terms. Or apparently a ratio of two polynomials works well; that’s what VCL uses for double.

I got excellent results from porting JRF’s SSE2 function to AVX2 + FMA (and especially AVX512 with _mm512_getexp_ps and _mm512_getmant_ps), once I tuned it carefully. (It was part of a commercial project, so I don’t think I can post the code.) A fast approximate implementation for float was exactly what I wanted.

In my use-case, each jrf_fastlog() was independent, so OOO execution nicely hid the FMA latency, and it wasn’t even worth using the higher-ILP shorter-latency polynomial evaluation method that VCL’s polynomial_5() function uses (“Estrin’s scheme”, which does some non-FMA multiplies before the FMAs, resulting in more total instructions).

Agner Fog’s VCL is now Apache-licensed, so any project can just include it directly. If you want high accuracy, you should just use VCL directly. It’s header-only, just inline functions, so it won’t bloat your binary.

VCL’s log float and double functions are in vectormath_exp.h. There are two main parts to the algorithm:

extract the exponent bits and convert that integer back into a float (after adjusting for the bias that IEEE FP uses).
extract the mantissa and OR in some exponent bits to get a vector of double values in the [0.5, 1.0) range. (Or (0.5, 1.0], I forget).

Further adjust this with if(mantissa <= SQRT2*0.5) { mantissa += mantissa; exponent++;}, and then mantissa -= 1.0.

Use a polynomial approximation to log(x) that is accurate around x=1.0. (For double, VCL’s log_d() uses a ratio of two 5th-order polynomials. @harold says this is often good for precision. One division mixed in with a lot of FMAs doesn’t usually hurt throughput, but it does have higher latency than an FMA. Using vrcpps + a Newton-Raphson iteration is typically slower than just using vdivps on modern hardware. Using a ratio also creates more ILP by evaluating two lower-order polynomials in parallel, instead of one high-order polynomial, and may lower overall latency vs. one long dep chain for a high-order polynomial (which would also accumulate significant rounding error along that one long chain).

Then add exponent + polynomial_approx_log(mantissa) to get the final log() result. VCL does this in multiple steps to reduce rounding error. ln2_lo + ln2_hi = ln(2). It’s split up into a small and a large constant to reduce rounding error.

// res is the polynomial(adjusted_mantissa) result
// fe is the float exponent
// x is the adjusted_mantissa.  x2 = x*x;
res  = mul_add(fe, ln2_lo, res);             // res += fe * ln2_lo;
res += nmul_add(x2, 0.5, x);                 // res += x  - 0.5 * x2;
res  = mul_add(fe, ln2_hi, res);             // res += fe * ln2_hi;

You can drop the 2-step ln2 stuff and just use VM_LN2 if you aren’t aiming for 0.5 or 1 ulp accuracy (or whatever this function actually provide; IDK.)

The x - 0.5*x2 part is really an extra polynomial term, I guess. This is what I meant by log base e being baked-in: you’d need a coefficient on those terms, or to get rid of that line and re-fit the polynomial coefficients for log2. You can’t just multiply all the polynomial coefficients by a constant.

After that, it checks for underflow, overflow or denormal, and branches if any element in the vector needs special processing to produce a proper NaN or -Inf rather than whatever garbage we got from the polynomial + exponent. If your values are known to be finite and positive, you can comment out this part and get a significant speedup (even the checking before the branch takes several instructions).

Efficient implementation of log2(__m256d) in AVX2

Further reading:

Leave a Comment Cancel reply

Further reading:

More Related Contents:

Leave a Comment Cancel reply