Fastest way to unpack 32 bits to a 32 byte SIMD vector
To “broadcast” the 32 bits of a 32-bit integer x to 32 bytes of a 256-bit YMM register z or 16 bytes of a two 128-bit XMM registers z_low and z_high you can do the following. With AVX2: __m256i y = _mm256_set1_epi32(x); __m256i z = _mm256_shuffle_epi8(y,mask1); z = _mm256_and_si256(z,mask2); Without AVX2 it’s best to do … Read more