How to perform round to even with floating point numbers

Just to make sure we’re on the same page, G is the most significant bit of the three, R comes next and S can be thought of as the least significant bit because its value partially represents the even less significant bits that have been truncated in the calculations. These three bits are only used while doing calculations and aren’t stored in the floating-point variable before or after the calculations.

This is what you should do in order to round the result to the nearest even number using G, R and S:

GRS – Action
0xx – round down = do nothing (x means any bit value, 0 or 1)
100 – this is a tie: round up if the mantissa’s bit just before G is 1, else round down=do nothing
101 – round up
110 – round up
111 – round up

Rounding up is done by adding 1 to the mantissa in the mantissa’s least significant bit position just before G. If the mantissa overflows (its 23 least significant bits that you will store become zeroes), you have to add 1 to the exponent. If the exponent overflows, you set the number to +infinity or -infinity depending on the number’s sign.

In the case of a tie, you add 1 to the mantissa if the mantissa is odd and you add nothing if it’s even. That’s what makes the result rounded to the nearest even value.

More Related Contents:

Leave a Comment Cancel reply