Fastest way to set __m256 value to all ONE bits

See also Set all bits in CPU register to 1 efficiently which covers AVX, AVX2, and AVX512 zmm and k (mask) registers.

You obviously didn’t even look at the asm output, which is trivial to do:

#include <immintrin.h>
__m256i all_ones(void) { return _mm256_set1_epi64x(-1); }

compiles to with GCC and clang with any -march that includes AVX2

    vpcmpeqd        ymm0, ymm0, ymm0
    ret

To get a __m256 (not __m256i) you can just cast the result:

  __m256 nans = _mm256_castsi256_ps( _mm256_set1_epi32(-1) );

Without AVX2, a possible option is vcmptrueps dst, ymm0,ymm0 preferably with a cold register for the input to mitigate the false dependency.

Recent clang (5.0 and later) does xor-zero a vector then vcmpps with a TRUE predicate if AVX2 isn’t available. Older clang makes a 128bit all-ones with vpcmpeqd xmm and uses vinsertf128. GCC loads from memory, even modern GCC 10.1 with -march=sandybridge.

As described by the vector section of Agner Fog’s optimizing assembly guide, generating constants on the fly this way is cheap. It still takes a vector execution unit to generate the all-ones (unlike _mm_setzero), but it’s better than any possible two-instruction sequence, and usually better than a load. See also the x86 tag wiki.

Compilers don’t like to generate more complex constants on the fly, even ones that could be generated from all-ones with a simple shift. Even if you try, by writing __m128i float_signbit_mask = _mm_srli_epi32(_mm_set1_epi16(-1), 1), compilers typically do constant-propagation and put the vector in memory. This lets them fold it into a memory operand when used later in cases where there’s no loop to hoist the constant out of.

And I can’t seem to find a simple bitwise NOT operation in AVX?

You do that by XORing with all-ones with vxorps (_mm256_xor_ps). Unfortunately SSE/AVX don’t provide a way to do a NOT without a vector constant.

FP vs Integer instructions and bypass delay

Intel CPUs (at least Skylake) have a weird effect where the extra bypass latency between SIMD-integer and SIMD-FP still happens long after the uop producing the register has executed. e.g. vmulps ymm1, ymm2, ymm0 could have an extra cycle of latency for the ymm2 -> ymm1 critical path if ymm0 was produced by vpcmpeqd. And this lasts until the next context switch restores FP state if you don’t otherwise overwrite ymm0.

This is not a problem for bitwise instructions like vxorps (even though the mnemonic has ps, it doesn’t have bypass delay from FP or vec-int domains on Skylake, IIRC).

So normally it’s safe to create a set1(-1) constant with an integer instruction because that’s a NaN and you wouldn’t normally use it with FP math instructions like mul or add.

More Related Contents:

Leave a Comment Cancel reply