## Maximum number of decimal digits that can affect a double

When you have a subnormal number with odd significand, that is, an odd multiple of 2^(-1074), you have a number whose last nonzero digit in the decimal representation is the 1074th after the decimal point. You then have around 300 zeros directly following the decimal point, so the number has around 750-770 significant decimal digits. … Read more

## Having parameter (constant) variable with NaN value in Fortran

To add to Vladimir F’s answer I’ll mention that gfortran 5.0 (but not earlier) supports the IEEE intrinsic modules. Instead of real x x=0 x=0/x one can use use, intrinsic :: iso_fortran_env use, intrinsic :: ieee_arithmetic integer(int32) i real(real32) x x = ieee_value(x, ieee_quiet_nan) i = transfer(x,i) This gives you a little flexibility over which … Read more

## What uses do floating point NaN payloads have?

It was thought to be a good idea when IEEE754 and NaN’s were developed. I have actually seen it used to store the reason why a NaN was created. Today, I wouldn’t use it in portable code for several reasons. How sure are you that this payload will survive for example an assignment? If you … Read more

## Next higher/lower IEEE double precision number

There are functions available for doing exactly that, but they can depend on what language you use. Two examples: if you have access to a decent C99 math library, you can use nextafter (and its float and long double variants, nextafterf and nextafterl); or the nexttoward family (which take a long double as second argument). … Read more

## How to perform round to even with floating point numbers

Just to make sure we’re on the same page, G is the most significant bit of the three, R comes next and S can be thought of as the least significant bit because its value partially represents the even less significant bits that have been truncated in the calculations. These three bits are only used … Read more

## Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

Yes it’s possible. But as of AVX2, it’s unlikely to be better than the scalar approaches with MULX/ADCX/ADOX. There’s virtually an unlimited number of variations of this approach for different input/output domains. I’ll only cover 3 of them, but they are easy to generalize once you know how they work. Disclaimers: All solutions here assume … Read more

## How can I use a HashMap with f64 as key in Rust?

Presented with no comment beyond read all the other comments and answers to understand why you probably don’t want to do this: use std::{collections::HashMap, hash}; #[derive(Debug, Copy, Clone)] struct DontUseThisUnlessYouUnderstandTheDangers(f64); impl DontUseThisUnlessYouUnderstandTheDangers { fn key(&self) -> u64 { self.0.to_bits() } } impl hash::Hash for DontUseThisUnlessYouUnderstandTheDangers { fn hash<H>(&self, state: &mut H) where H: hash::Hasher, { … Read more

## What’s the relative speed of floating point add vs. floating point multiply

It also depends on instruction mix. Your processor will have several computation units standing by at any time, and you’ll get maximum throughput if all of them are filled all the time. So, executing a loop of mul’s is just as fast as executing a loop or adds – but the same doesn’t hold if … Read more

## How do I get the minimum or maximum value of an iterator containing floating point numbers?

Floats have their own min and max methods that handle NaN consistently, so you can fold over the iterator: use std::f64; fn main() { let x = [2.0, 1.0, -10.0, 5.0, f64::NAN]; let min = x.iter().fold(f64::INFINITY, |a, &b| a.min(b)); println!(“{}”, min); } Prints -10. If you want different NaN handling, you can use PartialOrd::partial_cmp. For … Read more

## How do I use floating point number literals when using generic types?

Use the FromPrimitive trait: use num_traits::{cast::FromPrimitive, float::Float}; fn scale_float<T: Float + FromPrimitive>(x: T) -> T { x * T::from_f64(0.54).unwrap() } Or the standard library From / Into traits fn scale_float<T>(x: T) -> T where T: Float, f64: Into<T> { x * 0.54.into() } See also: How do I use number literals with the Integer trait … Read more