Loading 8 chars from memory into an __m256 variable as packed single precision floats

If you’re using AVX2, you can use PMOVZX to zero-extend your chars into 32-bit integers in a 256b register. From there, conversion to float can happen in-place.

; rsi = new_image
VPMOVZXBD   ymm0,  [rsi]   ; or SX to sign-extend  (Byte to DWord)
VCVTDQ2PS   ymm0, ymm0     ; convert to packed foat

This is a good strategy even if you want to do this for multiple vectors, but even better might be a 128-bit broadcast load to feed vpmovzxbd ymm,xmm and vpshufb ymm (_mm256_shuffle_epi8) for the high 64 bits, because Intel SnB-family CPUs don’t micro-fuse a vpmovzx ymm,mem, only only vpmovzx xmm,mem. (https://agner.org/optimize/). Broadcast loads are single uop with no ALU port required, running purely in a load port. So this is 3 total uops to bcast-load + vpmovzx + vpshufb.

(TODO: write an intrinsics version of that. It also sidesteps the problem of missed optimizations for _mm_loadl_epi64 -> _mm256_cvtepu8_epi32.)

Of course this requires a shuffle control vector in another register, so it’s only worth it if you can use that multiple times.

vpshufb is usable because the data needed for each lane is there from the broadcast, and the high bit of the shuffle-control will zero the corresponding element.

This broadcast + shuffle strategy might be good on Ryzen; Agner Fog doesn’t list uop counts for vpmovsx/zx ymm on it.

Do not do something like a 128-bit or 256-bit load and then shuffle that to feed further vpmovzx instructions. Total shuffle throughput will probably already be a bottleneck because vpmovzx is a shuffle. Intel Haswell/Skylake (the most common AVX2 uarches) have 1-per-clock shuffles but 2-per-clock loads. Using extra shuffle instructions instead of folding separate memory operands into vpmovzxbd is terrible. Only if you can reduce total uop count like I suggested with broadcast-load + vpmovzxbd + vpshufb is it a win.

My answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)? may be relevant for converting back to uint8_t. The pack-back-to-bytes afterward part is semi-tricky if doing it with AVX2 packssdw/packuswb, because they work in-lane, unlike vpmovzx.

With only AVX1, not AVX2, you should do:

VPMOVZXBD   xmm0,  [rsi]
VPMOVZXBD   xmm1,  [rsi+4]
VINSERTF128 ymm0, ymm0, xmm1, 1   ; put the 2nd load of data into the high128 of ymm0
VCVTDQ2PS   ymm0, ymm0     ; convert to packed float.  Yes, works without AVX2

You of course never need an array of float, just __m256 vectors.

GCC / MSVC missed optimizations for VPMOVZXBD ymm,[mem] with intrinsics

GCC and MSVC are bad at folding a _mm_loadl_epi64 into a memory operand for vpmovzx*. (But at least there is a load intrinsic of the right width, unlike for pmovzxbq xmm, word [mem].)

We get a vmovq load and then a separate vpmovzx with an XMM input. (With ICC and clang3.6+ we get safe + optimal code from using _mm_loadl_epi64, like from gcc9+)

But gcc8.3 and earlier can fold a _mm_loadu_si128 16-byte load intrinsic into an 8-byte memory operand. This gives optimal asm at -O3 on GCC, but is unsafe at -O0 where it compiles to an actual vmovdqu load that touches more data that we actually load, and could go off the end of a page.

Two gcc bugs submitted because of this answer:

SSE/AVX movq load (_mm_cvtsi64_si128) not being folded into pmovzx (fixed for gcc9, but the fix breaks load folding for a 128-bit load so the workaround hack for old GCC makes gcc9 do worse.)
No intrinsic for x86 MOVQ m64, %xmm in 32bit mode. (TODO: report this for clang/LLVM as well?)

There’s no intrinsic to use SSE4.1 pmovsx / pmovzx as a load, only with a __m128i source operand. But the asm instructions only read the amount of data they actually use, not a 16-byte __m128i memory source operand. Unlike punpck*, you can use this on the last 8B of a page without faulting. (And on unaligned addresses even with the non-AVX version).

So here’s the evil solution I’ve come up with. Don’t use this, #ifdef __OPTIMIZE__ is Bad, making it possible to create bugs that only happen in the debug build or only in the optimized build!

#if !defined(__OPTIMIZE__)
// Making your code compile differently with/without optimization is a TERRIBLE idea
// great way to create Heisenbugs that disappear when you try to debug them.
// Even if you *plan* to always use -Og for debugging, instead of -O0, this is still evil
#define USE_MOVQ
#endif

__m256 load_bytes_to_m256(uint8_t *p)
{
#ifdef  USE_MOVQ  // compiles to an actual movq then movzx ymm, xmm with gcc8.3 -O3
    __m128i small_load = _mm_loadl_epi64( (const __m128i*)p);
#else  // USE_LOADU // compiles to a 128b load with gcc -O0, potentially segfaulting
    __m128i small_load = _mm_loadu_si128( (const __m128i*)p );
#endif

    __m256i intvec = _mm256_cvtepu8_epi32( small_load );
    //__m256i intvec = _mm256_cvtepu8_epi32( *(__m128i*)p );  // compiles to an aligned load with -O0
    return _mm256_cvtepi32_ps(intvec);
}

With USE_MOVQ enabled, gcc -O3 (v5.3.0) emits. (So does MSVC)

load_bytes_to_m256(unsigned char*):
        vmovq   xmm0, QWORD PTR [rdi]
        vpmovzxbd       ymm0, xmm0
        vcvtdq2ps       ymm0, ymm0
        ret

The stupid vmovq is what we want to avoid. If you let it use the unsafe loadu_si128 version, it will make good optimized code.

GCC9, clang, and ICC emit:

load_bytes_to_m256(unsigned char*): 
        vpmovzxbd       ymm0, qword ptr [rdi] # ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
        vcvtdq2ps       ymm0, ymm0
        ret

Writing the AVX1-only version with intrinsics is left as an un-fun exercise for the reader. You asked for “instructions”, not “intrinsics”, and this is one place where there’s a gap in the intrinsics. Having to use _mm_cvtsi64_si128 to avoid potentially loading from out-of-bounds addresses is stupid, IMO. I want to be able to think of intrinsics in terms of the instructions they map to, with the load/store intrinsics as informing the compiler about alignment guarantees or lack thereof. Having to use the intrinsic for an instruction I don’t want is pretty dumb.

Also note that if you’re looking in the Intel insn ref manual, there are two separate entries for movq:

movd/movq, the version that can have an integer register as a src/dest operand (66 REX.W 0F 6E (or VEX.128.66.0F.W1 6E) for (V)MOVQ xmm, r/m64). That’s where you’ll find the intrinsic that can accept a 64-bit integer, _mm_cvtsi64_si128. (Some compilers don’t define it in 32-bit mode.)
movq: the version that can have two xmm registers as operands. This one is an extension of the MMXreg -> MMXreg instruction, which can also load/store, like MOVDQU. Its opcode F3 0F 7E (VEX.128.F3.0F.WIG 7E) for MOVQ xmm, xmm/m64).

The asm ISA ref manual only lists the m128i _mm_mov_epi64(__m128i a) intrinsic for zeroing the high 64b of a vector while copying it. But the intrinsics guide does list _mm_loadl_epi64(__m128i const* mem_addr) which has a stupid prototype (pointer to a 16-byte __m128i type when it really only loads 8 bytes). It is available on all 4 of the major x86 compilers, and should actually be safe. Note that the __m128i* is just passed to this opaque intrinsic, not actually dereferenced.

The more sane _mm_loadu_si64 (void const* mem_addr) is also listed, but gcc is missing that one.

More Related Contents:

Leave a Comment Cancel reply