sse2 - w3toppers.com

SSE multiplication of 4 32-bit integers

If you need signed 32×32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want: static inline __m128i muly(const __m128i &a, const __m128i &b) { __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/ __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */ return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE … Read more

Emulating shifts on 32 bytes with AVX

From different inputs, I gathered these solutions. The key to crossing the inter-lane barrier is the align instruction, _mm256_alignr_epi8. _mm256_slli_si256(A, N) 0 < N < 16 _mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), 16 – N) N = 16 _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)) 16 < N < 32 _mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, … Read more

Performance optimisations of x86-64 assembly – Alignment and branch prediction

Alignment optimisations 1. Use .p2align <abs-expr> <abs-expr> <abs-expr> instead of align. Grants fine-grained control using its 3 params param1 – Align to what boundary. param2 – Fill padding with what (zeroes or NOPs). param3 – Do NOT align if padding would exceed specified number of bytes. 2. Align the start of a frequently used code … Read more

Sum reduction of unsigned bytes without overflow, using SSE2 on Intel

Fast counting the number of set bits in __m128i register

Here are some codes I used in an old project (there is a research paper about it). The function popcnt8 below computes the number of bits set in each byte. SSE2-only version (based on Algorithm 3 in Hacker’s Delight book): static const __m128i popcount_mask1 = _mm_set1_epi8(0x77); static const __m128i popcount_mask2 = _mm_set1_epi8(0x0F); static inline __m128i … Read more

What is the point of SSE2 instructions such as orpd?

Extended (80-bit) double floating point in x87, not SSE2 – we don’t miss it?

The biggest problem with x87 is basically that all register operations are done in 80 bits, whereas most of the time people only use 64 bit floats (i.e. double-precision floats). What happens is, you load a 64 bit float into the x87 stack, and it gets converted to 80 bits. You do some operations on … Read more

What’s the difference between logical SSE intrinsics?

Is there any difference between using one or another intrinsic (with appropriate type casting). Won’t there be any hidden costs like longer execution in some specific situation? Yes, there can be performance reasons to choose one vs. the other. 1: Sometimes there is an extra cycle or two of latency (forwarding delay) if the output … Read more

Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

SIMD is meant to work on multiple small values at the same time, hence there won’t be any carry over to the higher unit and you must do that manually. In SSE2 there’s no carry flag but you can easily calculate the carry as carry = sum < a or carry = sum < b … Read more