avx512 - w3toppers.com

Do 128bit cross lane operations in AVX512 give better performance?

Generally yes, in-lane is still lower latency on SKX (1 cycle vs. 3), but usually it’s not worth spending extra instructions to use them instead of the powerful lane-crossing shuffles. However, vpermt2w and a couple other shuffles need multiple shuffle-port uops, so they cost as much as multiple simpler shuffles. Shuffle throughput very easily becomes … Read more

Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

No, a vpcmpeqb into a mask register does not trigger slow mode if you use a zmm register as one of the comparands, at least on SKX. This is also true of any of any other instruction (as far as I tested) which only reads the key 512-bit registers (the key registers being zmm0 – … Read more

How to transpose a 16×16 matrix using SIMD instructions?

For two operand instructions using SIMD you can show that the number of operations necessary to transpose a nxn matrix is n*log_2(n) whereas using scalar operations it’s O(n^2). In fact, later I’ll show that the number of read and write operations using the scalar registers is 2*n*(n-1). Below is a table showing the number of … Read more

What is the penalty of mixing EVEX and VEX encoded scheme?

There is no penalty for mixing any of VEX 128 / 256 or EVEX 128 / 256 / 512 on any current CPUs, and no reason to expect any penalty on future CPUs. All VEX and EVEX coded instructions are defined to zero the high bytes of the destination vector register, out to whatever the … Read more

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2

Related: if you’re looking for the non-existant _mm512_reduce_add_epu8, see Summing 8-bit integers in __m512i with AVX intrinsics vpsadbw as an hsum within qwords is much more efficient than shuffling. Without AVX512, see hsum_8x32(__m256i) below for AVX2 without Intel’s reduce_add helper function. reduce_add doesn’t necessarily compile optimally anyway with AVX512. There is a int _mm512_reduce_add_epi32(__m512i) inline … Read more

Per-element atomicity of vector load/store and gather/scatter?

Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?)

Extensions that introduce new architectural state require special OS support, because the OS has to save/restore restore more data on context switches. So from the OSes perspective, there’s nothing extra it needs to do to let user-space code run SSSE3 instructions, if the OS supports SSE. SSE, AVX, and AVX512 are the extensions that introduced … Read more

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

Most compilers will automatically define: __SSE__ __SSE2__ __SSE3__ __AVX__ __AVX2__ etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this: $ gcc -msse3 -dM -E – < /dev/null | egrep “SSE|AVX” | sort #define __SSE__ 1 #define __SSE2__ 1 #define … Read more

How to emulate _mm256_loadu_epi32 with gcc or clang?

Just use _mm256_loadu_si256 like a normal person. The only thing the AVX512 intrinsic gives you is a nicer prototype (const void* instead of const __m256i*) so you don’t have to write ugly casts. @chtz suggests out that you might still want to write a wrapper function yourself to get the void* prototype. But don’t call … Read more

Counting 1 bits (population count) on large data using AVX-512 or AVX-2

AVX-2 @HadiBreis’ comment links to an article on fast population-count with SSSE3, by Wojciech Muła; the article links to this GitHub repository; and the repository has the following AVX-2 implementation. It’s based on a vectorized lookup instruction, and using a 16-value lookup table for the bit counts of nibbles. # include <immintrin.h> # include <x86intrin.h> … Read more