Visual Studio 2017: _mm_load_ps often compiled to movups

On recent versions of Visual Studio and the Intel Compiler (recent as post-2013?), the compiler rarely ever generates aligned SIMD load/stores anymore. When compiling for AVX or higher: The Microsoft compiler (>VS2013?) doesn’t generate aligned loads. But it still generates aligned stores. The Intel compiler (> Parallel Studio 2012?) doesn’t do it at all anymore. … Read more

Can PTEST be used to test if two registers are both zero or some other condition?

No, unless I’m missing something clever, ptest with two unknown registers is generally not useful for checking some property about both of them. (Other than obvious stuff you’d already want a bitwise-AND for, like intersection between two bitmaps). To test two registers for both being all-zero, OR them together and PTEST that against itself. ptest … Read more

Where is VPERMB in AVX2?

I’m 99% sure the main factor is transistor cost of implementation. It would clearly be very useful, and the only reason it doesn’t exist is that the implementation cost must outweigh the significant benefit. Coding space issues are unlikely; the VEX coding space provides a LOT of room. Like, really a lot, since the field … Read more

SSE multiplication of 4 32-bit integers

If you need signed 32×32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want: static inline __m128i muly(const __m128i &a, const __m128i &b) { __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/ __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */ return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE … Read more