SSE multiplication of 4 32-bit integers

If you need signed 32×32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want: static inline __m128i muly(const __m128i &a, const __m128i &b) { __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/ __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */ return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE … Read more

Emulating shifts on 32 bytes with AVX

From different inputs, I gathered these solutions. The key to crossing the inter-lane barrier is the align instruction, _mm256_alignr_epi8. _mm256_slli_si256(A, N) 0 < N < 16 _mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), 16 – N) N = 16 _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)) 16 < N < 32 _mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, … Read more

Performance optimisations of x86-64 assembly – Alignment and branch prediction

Alignment optimisations 1. Use .p2align <abs-expr> <abs-expr> <abs-expr> instead of align. Grants fine-grained control using its 3 params param1 – Align to what boundary. param2 – Fill padding with what (zeroes or NOPs). param3 – Do NOT align if padding would exceed specified number of bytes. 2. Align the start of a frequently used code … Read more

Fast counting the number of set bits in __m128i register

Here are some codes I used in an old project (there is a research paper about it). The function popcnt8 below computes the number of bits set in each byte. SSE2-only version (based on Algorithm 3 in Hacker’s Delight book): static const __m128i popcount_mask1 = _mm_set1_epi8(0x77); static const __m128i popcount_mask2 = _mm_set1_epi8(0x0F); static inline __m128i … Read more