How to efficiently perform double/int64 conversions with SSE/AVX?

There’s no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like _mm512_cvtpd_epi64 and the narrower AVX512VL versions, like _mm256_cvtpd_epi64.

If you only have AVX2 or less, you’ll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double or float from SSE2, but scalar uint64_t <-> FP requires tricks until AVX512 adds unsigned conversions. Scalar 32-bit unsigned can be done by zero-extending to 64-bit signed.)

If you’re willing to cut corners, double <-> int64 conversions can be done in only two instructions:

If you don’t care about infinity or NaN.
For double <-> int64_t, you only care about values in the range [-2^51, 2^51].
For double <-> uint64_t, you only care about values in the range [0, 2^52).

double -> uint64_t

//  Only works for inputs in the range: [0, 2^52)
__m128i double_to_uint64(__m128d x){
    x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
    return _mm_xor_si128(
        _mm_castpd_si128(x),
        _mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
    );
}

double -> int64_t

//  Only works for inputs in the range: [-2^51, 2^51]
__m128i double_to_int64(__m128d x){
    x = _mm_add_pd(x, _mm_set1_pd(0x0018000000000000));
    return _mm_sub_epi64(
        _mm_castpd_si128(x),
        _mm_castpd_si128(_mm_set1_pd(0x0018000000000000))
    );
}

uint64_t -> double

//  Only works for inputs in the range: [0, 2^52)
__m128d uint64_to_double(__m128i x){
    x = _mm_or_si128(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)));
    return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0010000000000000));
}

int64_t -> double

//  Only works for inputs in the range: [-2^51, 2^51]
__m128d int64_to_double(__m128i x){
    x = _mm_add_epi64(x, _mm_castpd_si128(_mm_set1_pd(0x0018000000000000)));
    return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0018000000000000));
}

Rounding Behavior:

For the double -> uint64_t conversion, rounding works correctly following the current rounding mode. (which is usually round-to-even)
For the double -> int64_t conversion, rounding will follow the current rounding mode for all modes except truncation. If the current rounding mode is truncation (round towards zero), it will actually round towards negative infinity.

How does it work?

Despite this trick being only 2 instructions, it’s not entirely self-explanatory.

The key is to recognize that for double-precision floating-point, values in the range [2^52, 2^53) have the “binary place” just below the lowest bit of the mantissa. In other words, if you zero out the exponent and sign bits, the mantissa becomes precisely the integer representation.

To convert x from double -> uint64_t, you add the magic number M which is the floating-point value of 2^52. This puts x into the “normalized” range of [2^52, 2^53) and conveniently rounds away the fractional part bits.

Now all that’s left is to remove the upper 12 bits. This is easily done by masking it out. The fastest way is to recognize that those upper 12 bits are identical to those of M. So rather than introducing an additional mask constant, we can simply subtract or XOR by M. XOR has more throughput.

Converting from uint64_t -> double is simply the reverse of this process. You add back the exponent bits of M. Then un-normalize the number by subtracting M in floating-point.

The signed integer conversions are slightly trickier since you need to deal with the 2’s complement sign-extension. I’ll leave those as an exercise for the reader.

Full Range int64 -> double:

After many years, I finally had a need for this.

5 instructions for uint64_t -> double
6 instructions for int64_t -> double

uint64_t -> double

__m128d uint64_to_double_full(__m128i x){
    __m128i xH = _mm_srli_epi64(x, 32);
    xH = _mm_or_si128(xH, _mm_castpd_si128(_mm_set1_pd(19342813113834066795298816.)));          //  2^84
    __m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0xcc);   //  2^52
    __m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(19342813118337666422669312.));     //  2^84 + 2^52
    return _mm_add_pd(f, _mm_castsi128_pd(xL));
}

int64_t -> double

__m128d int64_to_double_full(__m128i x){
    __m128i xH = _mm_srai_epi32(x, 16);
    xH = _mm_blend_epi16(xH, _mm_setzero_si128(), 0x33);
    xH = _mm_add_epi64(xH, _mm_castpd_si128(_mm_set1_pd(442721857769029238784.)));              //  3*2^67
    __m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0x88);   //  2^52
    __m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(442726361368656609280.));          //  3*2^67 + 2^52
    return _mm_add_pd(f, _mm_castsi128_pd(xL));
}

These work for the entire 64-bit range and are correctly rounded to the current rounding behavior.

These are similar wim’s answer below – but with more abusive optimizations. As such, deciphering these will also be left as an exercise to the reader.

More Related Contents:

Leave a Comment Cancel reply