avx2 - w3toppers.com

AVX2: Computing dot product of 512 float arrays

_mm256_dp_ps is only useful for dot-products of 2 to 4 elements; for longer vectors use vertical SIMD in a loop and reduce to scalar at the end. Using _mm256_dp_ps and _mm256_add_ps in a loop would be much slower. GCC and clang require you to enable (with command line options) ISA extensions that you use intrinsics … Read more

How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)

For good throughput with multiple source vectors, it’s a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn’t necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that … Read more

Fastest way to set __m256 value to all ONE bits

See also Set all bits in CPU register to 1 efficiently which covers AVX, AVX2, and AVX512 zmm and k (mask) registers. You obviously didn’t even look at the asm output, which is trivial to do: #include <immintrin.h> __m256i all_ones(void) { return _mm256_set1_epi64x(-1); } compiles to with GCC and clang with any -march that includes … Read more

Find the first instance of a character using simd

You have the right idea with _mm256_cmpeq_epi8 -> _mm256_movemask_epi8. AFAIK, that’s the optimal way to implement this for Intel CPUs at least. PMOVMSKB r32, ymm is the same speed as the XMM 16-byte version, so it would be a huge loss to unpack the two lanes of a 256b vector and movemask them separately and … Read more

Where is VPERMB in AVX2?

I’m 99% sure the main factor is transistor cost of implementation. It would clearly be very useful, and the only reason it doesn’t exist is that the implementation cost must outweigh the significant benefit. Coding space issues are unlikely; the VEX coding space provides a LOT of room. Like, really a lot, since the field … Read more

Load address calculation when using AVX2 gather instructions

Gather instructions do not have any alignment requirements. So it would be too restrictive not to allow byte addressing. Other reason is consistency. With SIB addressing we obviously have byte address: MOV eax, [rcx + rdx * 2] Since VPGATHERDD is just a vectorized variant of this MOV instruction, we should not expect anything different … Read more

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

Yes it’s possible. But as of AVX2, it’s unlikely to be better than the scalar approaches with MULX/ADCX/ADOX. There’s virtually an unlimited number of variations of this approach for different input/output domains. I’ll only cover 3 of them, but they are easy to generalize once you know how they work. Disclaimers: All solutions here assume … Read more

In what situation would the AVX2 gather instructions be faster than individually loading the data?

Newer microarchitectures have shifted the odds towards gather instructions. On an Intel Xeon Gold 6138 CPU @ 2.00 GHz with Skylake microarchitecture, we get for your benchmark: 9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 6.649e+09 1.421e+09 2.362e+09 2.7e+07 8.69e+09 5.9e+07 7.763e+09 3.926e+09 5.4e+08 3.426e+09 9.172e+09 5.736e+09 9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 … Read more

Get sum of values stored in __m256d with SSE/AVX

It appears that you’re doing a horizontal sum for every element of an output array. (Perhaps as part of a matmul?) This is usually sub-optimal; try to vectorize over the 2nd-from-inner loop so you can produce result[i + 0..3] in a vector and not need a horizontal sum at all. For a dot-product of an … Read more

Fastest way to unpack 32 bits to a 32 byte SIMD vector

To “broadcast” the 32 bits of a 32-bit integer x to 32 bytes of a 256-bit YMM register z or 16 bytes of a two 128-bit XMM registers z_low and z_high you can do the following. With AVX2: __m256i y = _mm256_set1_epi32(x); __m256i z = _mm256_shuffle_epi8(y,mask1); z = _mm256_and_si256(z,mask2); Without AVX2 it’s best to do … Read more