avx - w3toppers.com

Fastest way to do horizontal vector sum with AVX instructions [duplicate]

If you have two __m256d vectors x1 and x2 that each contain four doubles that you want to horizontally sum, you could do: __m256d x1, x2; // calculate 4 two-element horizontal sums: // lower 64 bits contain x1[0] + x1[1] // next 64 bits contain x2[0] + x2[1] // next 64 bits contain x1[2] + … Read more

Convention for displaying vector registers

Being consistent is the most important thing; If I’m working on existing code that already has LSE-first comments or variable names, I match that. Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes. Intel uses MSE-first not only in their diagrams in … Read more

SIMD math libraries for SSE and AVX

I have implemented Vecmathlib https://bitbucket.org/eschnett/vecmathlib/ as a generic libraries for two other projects (The Einstein Toolkit, and pocl http://pocl.sourceforge.net/). Vecmathlib is open source, and is written in C++.

Get sum of values stored in __m256d with SSE/AVX

It appears that you’re doing a horizontal sum for every element of an output array. (Perhaps as part of a matmul?) This is usually sub-optimal; try to vectorize over the 2nd-from-inner loop so you can produce result[i + 0..3] in a vector and not need a horizontal sum at all. For a dot-product of an … Read more

Fastest way to unpack 32 bits to a 32 byte SIMD vector

To “broadcast” the 32 bits of a 32-bit integer x to 32 bytes of a 256-bit YMM register z or 16 bytes of a two 128-bit XMM registers z_low and z_high you can do the following. With AVX2: __m256i y = _mm256_set1_epi32(x); __m256i z = _mm256_shuffle_epi8(y,mask1); z = _mm256_and_si256(z,mask2); Without AVX2 it’s best to do … Read more

Transpose an 8×8 float using AVX/AVX2

I already answered this question Fast memory transpose with SSE, AVX, and OpenMP. Let me repeat the solution for transposing an 8×8 float matrix with AVX. Let me know if this is any faster than using 4×4 blocks and _MM_TRANSPOSE4_PS. I used it for a kernel in a larger matrix transpose which was memory bound … Read more

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate). An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two. The IEEE and C standards allow this when … Read more

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2

Related: if you’re looking for the non-existant _mm512_reduce_add_epu8, see Summing 8-bit integers in __m512i with AVX intrinsics vpsadbw as an hsum within qwords is much more efficient than shuffling. Without AVX512, see hsum_8x32(__m256i) below for AVX2 without Intel’s reduce_add helper function. reduce_add doesn’t necessarily compile optimally anyway with AVX512. There is a int _mm512_reduce_add_epi32(__m512i) inline … Read more