Efficient way to round double precision numbers to a lower precision given in number of bits

Dekker’s algorithm will split a floating-point number into high and low parts. If there are s bits in the significand (53 in IEEE 754 64-bit binary), then *x0 receives the high s–b bits, which is what you requested, and *x1 receives the remaining bits, which you may discard. In the code below, Scale should have the value 2^b. If b is known at compile time, e.g., the constant 43, you can replace Scale with 0x1p43. Otherwise, you must produce 2^b in some way.

This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even, which is not what you requested (ties upward). Is that necessary?

This assumes that x * (Scale + 1) does not overflow. The operations must be evaluated in double precision (not greater).

void Split(double *x0, double *x1, double x)
{
    double d = x * (Scale + 1);
    double t = d - x;
    *x0 = d - t;
    *x1 = x - *x0;
}

More Related Contents:

Leave a Comment Cancel reply