sse - w3toppers.com

Most insanely fast way to convert 9 char digits into an int or unsigned int

Yes, SIMD is possible, as mentioned in comments. You can take advantage of it to parse the HH, MM, and SS parts of the string at the same time. Since you have a 100% fixed format with leading 0s where necessary, this is easier than How to implement atoi using SIMD? – Place-values are fixed … Read more

Efficient sse shuffle mask generation for left-packing byte elements

Assuming: change1 = _mm_movemask_epi8(bytemask); offset = popcnt(change1); On large buffers, using two shuffles and a 1 KiB table is only ~10% slower than using 1 shuffle and a 1MiB table. My attempts at generating the shuffle mask via prefix sums and bit twiddling are about about half the speed of the table based methods (solutions … Read more

Find the first instance of a character using simd

You have the right idea with _mm256_cmpeq_epi8 -> _mm256_movemask_epi8. AFAIK, that’s the optimal way to implement this for Intel CPUs at least. PMOVMSKB r32, ymm is the same speed as the XMM 16-byte version, so it would be a huge loss to unpack the two lanes of a 256b vector and movemask them separately and … Read more

Visual Studio 2017: _mm_load_ps often compiled to movups

On recent versions of Visual Studio and the Intel Compiler (recent as post-2013?), the compiler rarely ever generates aligned SIMD load/stores anymore. When compiling for AVX or higher: The Microsoft compiler (>VS2013?) doesn’t generate aligned loads. But it still generates aligned stores. The Intel compiler (> Parallel Studio 2012?) doesn’t do it at all anymore. … Read more

Can PTEST be used to test if two registers are both zero or some other condition?

No, unless I’m missing something clever, ptest with two unknown registers is generally not useful for checking some property about both of them. (Other than obvious stuff you’d already want a bitwise-AND for, like intersection between two bitmaps). To test two registers for both being all-zero, OR them together and PTEST that against itself. ptest … Read more

Where is VPERMB in AVX2?

I’m 99% sure the main factor is transistor cost of implementation. It would clearly be very useful, and the only reason it doesn’t exist is that the implementation cost must outweigh the significant benefit. Coding space issues are unlikely; the VEX coding space provides a LOT of room. Like, really a lot, since the field … Read more

Load address calculation when using AVX2 gather instructions

Gather instructions do not have any alignment requirements. So it would be too restrictive not to allow byte addressing. Other reason is consistency. With SIB addressing we obviously have byte address: MOV eax, [rcx + rdx * 2] Since VPGATHERDD is just a vectorized variant of this MOV instruction, we should not expect anything different … Read more

SSE multiplication of 4 32-bit integers

If you need signed 32×32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want: static inline __m128i muly(const __m128i &a, const __m128i &b) { __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/ __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */ return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE … Read more

Fastest way to do horizontal vector sum with AVX instructions [duplicate]

If you have two __m256d vectors x1 and x2 that each contain four doubles that you want to horizontally sum, you could do: __m256d x1, x2; // calculate 4 two-element horizontal sums: // lower 64 bits contain x1[0] + x1[1] // next 64 bits contain x2[0] + x2[1] // next 64 bits contain x1[2] + … Read more

Get member of __m128 by index?

A union is probably the most portable way to do this: union { __m128 v; // SSE 4 x float vector float a[4]; // scalar array of 4 floats } U; float vectorGetByIndex(__m128 V, unsigned int i) { U u; assert(i <= 3); u.v = V; return u.a[i]; }