avx2 - w3toppers.com

Transpose an 8×8 float using AVX/AVX2

I already answered this question Fast memory transpose with SSE, AVX, and OpenMP. Let me repeat the solution for transposing an 8×8 float matrix with AVX. Let me know if this is any faster than using 4×4 blocks and _MM_TRANSPOSE4_PS. I used it for a kernel in a larger matrix transpose which was memory bound … Read more

Emulating shifts on 32 bytes with AVX

From different inputs, I gathered these solutions. The key to crossing the inter-lane barrier is the align instruction, _mm256_alignr_epi8. _mm256_slli_si256(A, N) 0 < N < 16 _mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), 16 – N) N = 16 _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)) 16 < N < 32 _mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, … Read more

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2

Related: if you’re looking for the non-existant _mm512_reduce_add_epu8, see Summing 8-bit integers in __m512i with AVX intrinsics vpsadbw as an hsum within qwords is much more efficient than shuffling. Without AVX512, see hsum_8x32(__m256i) below for AVX2 without Intel’s reduce_add helper function. reduce_add doesn’t necessarily compile optimally anyway with AVX512. There is a int _mm512_reduce_add_epi32(__m512i) inline … Read more

Counting 1 bits (population count) on large data using AVX-512 or AVX-2

AVX-2 @HadiBreis’ comment links to an article on fast population-count with SSSE3, by Wojciech Muła; the article links to this GitHub repository; and the repository has the following AVX-2 implementation. It’s based on a vectorized lookup instruction, and using a 16-value lookup table for the bit counts of nibbles. # include <immintrin.h> # include <x86intrin.h> … Read more

Fastest way to multiply an array of int64_t?

You seem to be assuming long is 64bits in your code, but then using __uint64_t as well. In 32bit, the x32 ABI, and on Windows, long is a 32bit type. Your title mentions long long, but then your code ignores it. I was wondering for a while if your code was assuming that long was … Read more

Fastest Implementation of Exponential Function Using AVX

How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD [duplicate]

Loading 8 chars from memory into an __m256 variable as packed single precision floats

If you’re using AVX2, you can use PMOVZX to zero-extend your chars into 32-bit integers in a 256b register. From there, conversion to float can happen in-place. ; rsi = new_image VPMOVZXBD ymm0, [rsi] ; or SX to sign-extend (Byte to DWord) VCVTDQ2PS ymm0, ymm0 ; convert to packed foat This is a good strategy … Read more

How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

Efficient implementation of log2(__m256d) in AVX2

The usual strategy is based on the identity log(a*b) = log(a) + log(b), or in this case log2( 2^exponent * mantissa) ) = log2( 2^exponent ) + log2(mantissa). Or simplifying, exponent + log2(mantissa). The mantissa has a very limited range, 1.0 to 2.0, so a polynomial for log2(mantissa) only has to fit over that very … Read more