Do 128bit cross lane operations in AVX512 give better performance?

Generally yes, in-lane is still lower latency on SKX (1 cycle vs. 3), but usually it’s not worth spending extra instructions to use them instead of the powerful lane-crossing shuffles. However, vpermt2w and a couple other shuffles need multiple shuffle-port uops, so they cost as much as multiple simpler shuffles. Shuffle throughput very easily becomes … Read more

Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

No, a vpcmpeqb into a mask register does not trigger slow mode if you use a zmm register as one of the comparands, at least on SKX. This is also true of any of any other instruction (as far as I tested) which only reads the key 512-bit registers (the key registers being zmm0 – … Read more

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2

Related: if you’re looking for the non-existant _mm512_reduce_add_epu8, see Summing 8-bit integers in __m512i with AVX intrinsics vpsadbw as an hsum within qwords is much more efficient than shuffling. Without AVX512, see hsum_8x32(__m256i) below for AVX2 without Intel’s reduce_add helper function. reduce_add doesn’t necessarily compile optimally anyway with AVX512. There is a int _mm512_reduce_add_epi32(__m512i) inline … Read more

Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?)

Extensions that introduce new architectural state require special OS support, because the OS has to save/restore restore more data on context switches. So from the OSes perspective, there’s nothing extra it needs to do to let user-space code run SSSE3 instructions, if the OS supports SSE. SSE, AVX, and AVX512 are the extensions that introduced … Read more

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

Most compilers will automatically define: __SSE__ __SSE2__ __SSE3__ __AVX__ __AVX2__ etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this: $ gcc -msse3 -dM -E – < /dev/null | egrep “SSE|AVX” | sort #define __SSE__ 1 #define __SSE2__ 1 #define … Read more

Counting 1 bits (population count) on large data using AVX-512 or AVX-2

AVX-2 @HadiBreis’ comment links to an article on fast population-count with SSSE3, by Wojciech Muła; the article links to this GitHub repository; and the repository has the following AVX-2 implementation. It’s based on a vectorized lookup instruction, and using a 16-value lookup table for the bit counts of nibbles. # include <immintrin.h> # include <x86intrin.h> … Read more