How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)

For good throughput with multiple source vectors, it’s a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn’t necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that … Read more

Where is VPERMB in AVX2?

I’m 99% sure the main factor is transistor cost of implementation. It would clearly be very useful, and the only reason it doesn’t exist is that the implementation cost must outweigh the significant benefit. Coding space issues are unlikely; the VEX coding space provides a LOT of room. Like, really a lot, since the field … Read more

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

Yes it’s possible. But as of AVX2, it’s unlikely to be better than the scalar approaches with MULX/ADCX/ADOX. There’s virtually an unlimited number of variations of this approach for different input/output domains. I’ll only cover 3 of them, but they are easy to generalize once you know how they work. Disclaimers: All solutions here assume … Read more

In what situation would the AVX2 gather instructions be faster than individually loading the data?

Newer microarchitectures have shifted the odds towards gather instructions. On an Intel Xeon Gold 6138 CPU @ 2.00 GHz with Skylake microarchitecture, we get for your benchmark: 9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 6.649e+09 1.421e+09 2.362e+09 2.7e+07 8.69e+09 5.9e+07 7.763e+09 3.926e+09 5.4e+08 3.426e+09 9.172e+09 5.736e+09 9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 … Read more