Do 128bit cross lane operations in AVX512 give better performance?

Question

Generally yes, in-lane is still lower latency on SKX (1 cycle vs. 3), but usually it’s not worth spending extra instructions to use them instead of the powerful lane-crossing shuffles. However, vpermt2w and a couple other shuffles need multiple shuffle-port uops, so they cost as much as multiple simpler shuffles.

Shuffle throughput very easily becomes a bottleneck if you aren’t careful on recent Intel CPUs (only one shuffle execution unit on port 5). Sometimes it’s even worth using two overlapping loads instead of loading once and shuffling, i.e. using an unaligned load as a shuffle, because L1D cache is fast, and so is load-port handling of unaligned loads. (Less so with AVX512, though, especially because every unaligned 512b load is automatically a cache-line split, because vectors and cache lines are both 64 bytes.)

For 256-bit loads, one trick is to choose a load address that splits the data you care about into the two lanes, so you can use a vpshufb (_mm256_shuffle_epi8) in-lane byte shuffle to get each byte where it’s needed.

There are also rotate (new in AVX512) and shift instructions (not new). The 64-bit element size versions can move data between smaller elements if you use a shift or rotate count of 32 or 16, for example. vprolq zmm, zmm, 32 is 1c latency and runs on port 0 (and also port1 for the xmm/ymm versions), swapping every element with it’s neighbour. Shifts/rotates don’t compete for port 5 on SKX.

For a horizontal sum, the only real choice is what order to shuffle in. Usually start with extract / add down to 128b, then use __m128 shuffles (or integer shifts), instead of using vpermd/q for every shuffle. Or if you want the result broadcast to all elements, use in-lane shuffles between the first few adds, and then shuffle in 128b then 256b chunks with lane-crossing shuffles. (Shuffling in 128b chunks isn’t faster than smaller granularity immediate-control shuffles like vpermq z,z,imm8 on SKX, but that’s all you need for an hsum after doing the in-lane stuff with vshufps or vpermilps.)

Note that some future CPUs may split 512b ops into two 256b ops, perhaps an E-core in some future Intel CPU. If done like Zen 1, that could make 512-bit lane-crossing shuffles significantly more expensive. Even vperm2f128 on Zen1 is 8 uops, 3c lat / 3c throughput, vs. 1 uop on SKL. In-lane shuffles obviously decompose into 1 uop per lane fairly easily, but lane-crossing doesn’t.

Fortunately Zen 4 does not have this downside – 512-bit instructions are still single-uop (including ZMM shuffles like vpermt2d), they just occupy an execution port for longer. This let them make a few execution units be full-width, notably shuffles. So things like vpermb are as fast as on Intel, or faster like vpermt2b which Intel runs as 3 uops. This implementation strategy is much better than Zen 1’s, generally not creating performance potholes, and making 512-bit vectors more usable on Zen 4 for general-purpose uses than they are on Intel. (Saving front-end bandwidth for the same amount of back-end work.) https://uops.info/ and see Mysticial’s writeup.

Xeon Phi (discontinued)

On KNL, it’s not lanes, it’s 1-source vs. 2-source shuffles that matter.
e.g. vshufps dst, same,same, imm8 is half the throughput of vpermilps dst, src, imm8.
1-source shuffles with a vector control like vpermd v,v,v are still fast, though (1 source + 1 shuffle-control vector).

Even when they’re only 1 uop, the 4-7c latency shuffles (2-input) have worse than 2c throughput. I guess that means KNL’s shuffle unit isn’t fully pipelined.

Raw data

https://uops.info/ is the go-to for uops / latency / ports microbenchmark info these days. Generally well-crafted microbenchmarks and detailed results that don’t try to boil things down to a single number when there multiple uops and different latencies from different inputs to the ouput(s). And no manual typos like there sometimes are in Agner Fog’s otherwise-good instruction tables. Agner’s microarch guide is essential reading for understanding the numbers, and possible other bottlenecks like in the front-end.

When this answer was first written, https://uops.info/ didn’t exist, and Agner Fog didn’t yet have test results for Skylake-X (SKX) aka SKL-SP or gcc -march=skylake-avx512. But there was already InstLatx64 (Instruction throughput/Latency) results, and IACA support. InstLatx64 has a spreadsheet (ODS OpenOffice/LibreOffice format) combining data from IACA (just uop count and ports), and published by Intel in a PDF (throughput/latency), and from real experimental testing on real hardware (throughput/latency). These days https://uops.info/ is pretty quick to get new microarchitectures tested, but InstLat sometimes has CPUID dumps before test results.

Agner Fog’s instruction tables have data for Knight’s Landing Xeon Phi (KNL), and there’s a section about it’s Silvermont-based microarchitecture in his microarch PDF.

KNL instructions have better latency if their input is coming from the same execution unit (e.g. shuffle -> shuffle) vs. FMA -> shuffle. (See the note at the top of Agner’s spreadsheet). This is what the 4-7c latency numbers are about. A transpose or something doing a chain of shuffles might see mostly the lower latency number. (But KNL has generally high latencies, which is why it has 4-way hyperthreading
to try to hide them).

SKX: Skylake-AVX512 (and probably future mainstream Intel CPUs)

All lane crossing shuffles are at best 1 uop, 3c latency, 1c throughput. But even complex/powerful ones like 2-input vpermt2ps are that fast. This includes all shuffles that shuffle whole lanes, or insert/extract 256b chunks.

All in-lane-only shuffles are 1c latency (except for the xmm version of some new-in-avx512 lane-crossing shuffles). So use vpshufd zmm, zmm, imm8 or vpunpcklqdq zmm, zmm, zmm when that’s all you need. Or vpshufb or vpermilps with a vector control input.

Like Haswell and SKL (non-avx512), SKX can only run shuffle uops on port 5. Again like those earlier CPUs, it can broadcast-load using only the load ports, so that’s just as cheap as a regular vector load. AVX512 broadcast loads can micro-fuse, making memory-source broadcasts cheaper (in shuffle throughput terms) than register source.

Even vmovsldup ymm, [mem] / vmovshdup ymm, [mem] use just a load uop for the 256b shuffle. IDK about 512b; Instlat didn’t test memory-source movsl/hdup, so we only have Agner Fog’s data. (And IIRC I confirmed that on my own SKL).

Note that when running 512b instructions, the vector ALUs on port 1 are disabled, so you have a max throughput of 2 vector ALU uops per clock. (But p1 can still run integer stuff.) And vector load/store uops don’t need p0 / p5, so you can still bottleneck on the front-end (4 uops per clock issue/rename) in code with a mix of non-fused loads, stores, and ALU (and integer loop overhead, and vmovdqa register copying handled in the rename stage with unfused-domain uop).

Exceptions to the rule on SKX:

VPMOVWB ymm, zmm and similar truncate or signed/unsigned saturate instructions are 2 uops, 4c latency. (Or 2c for the xmm versions). vpmovqd is 1 uop, 3c (or 1c xmm) latency, because its smallest granularity is dword and it’s only truncating, not saturating, so it can be implemented internally with the same hardware that’s needed for pshufb for example. vpmovz/sx instructions are still only 1 uop.
vpcompressd/q (left-pack based on a mask) is 2 uops (p5), 3c latency. (Or 6c according to what Intel publishes; maybe Instlat is testing the vector->vector latency and Intel is giving the k register -> vector latency? Unlikely that it’s data-dependent and faster with a trivial mask.) vpexpandd is also 2 uops.
AVX512BW vpermt2w / vpermi2w is 3 uops (p0 + 2p5), 7c latency for all 3 operand sizes (xmm/ymm/zmm). Small-granularity wide shuffles are expensive in hardware (See Where is VPERMB in AVX2? including the comments). This is a 2-source 16-bit-element shuffle with the control in a 3rd vector. It might get faster eventually in future generations, the way pshufb (and all full-register shuffles with granularity smaller than 8 bytes) was slow in first-gen Core2 Conroe/Merom, but got fast in the die-shrink next generation (Penryn).
AVX512BW vpermw (one-source lane-crossing word shuffle) is 2p5, 6c latency, 2c throughput because it’s a lane-crossing word shuffle.
expect AVX512VBMI vpermt2b to be as bad or worse on Cannonlake, even if Cannonlake does improve vpermt2w / vpermw.
vpermt2d/q/ps/pd are all efficient in SKX because their granularity is dword (32-bit) or wider. (But still apparently 3c latency for the xmm version, so they didn’t build separate hardware to speed up the one-lane version). These are even more powerful than a lane-crossing shufps: a variable control and with no limitation on which source register each element comes from. It’s a fully general 2-source shuffle where you index into the concatenation of 2 registers, overwriting the index (vpermi2*) or one of the tables (vpermt2*). There’s only one intrinsic because the compiler handles register allocation and copying to preserve still-needed values.

Knight’s Landing:

Shuffles run on the FP0 port only, but front-end throughput is only 2 uops per clock. So more of your total instructions can be shuffles without bottlenecking on that (vs. SKX), unless they’re half-throughput shuffles.

In general, 2-input shuffles like vperm2f128/vshuff32x4 or vshufps are 2c throughput / 4-7c latency, while 1-input shuffles like vpermd are 1c throughput / 3-6c latency. (i.e. 2 inputs occupies the shuffle unit for an extra cycle (half throughput) and costs 1 extra cycle of latency). Agner isn’t clear on exactly what the effect of the not-fully-pipelined shuffles is, but I assume it just ties up the shuffle unit, and not everything on port FP0 (like the FMA unit).

Lane-crossing or not makes no difference on KNL, e.g. vpermilps and vpermps are both fast (1c throughput, 3-6c latency), but vpermi2ps and vshufps are both slow (2c throughput, 4-7c latency). I don’t see any exceptions to that for instructions where KNL supports an AVX512 version. (i.e. not counting AVX2 vpshufb, i.e. pretty much anything with 32-bit or larger granularity).
vinserti32x4 and so on (insert/extract with granularity of at least 128b) is a 2-input shuffle for insert, but is fast: 3-6c lat / 1c tput. But extract-to-memory is multiple uops and causes a decode bottleneck: e.g. VEXTRACTF32X4 m128,z is 4 uops, one per 8c throughput. (mostly because of decode).
vcompress/ps/d, vpcompressd/q and v[p]expandd/q/ps/pd are 1 uop, 3-6c latency. (vs. 2 uops on SKX). But throughput is only one per 3c: Agner doesn’t indicate whether this ties up the whole shuffle unit for 2c, or if only this part is not fully pipelined.
AVX2 byte/word shuffles are very slow for 256b operand-size: pshufb xmm is 5 uops / 10c throughput, vpshufb ymm is 12 uops / 12c throughput. (MMX pshufb mm is 1 uop, 2-6c latency, 1c throughput, so I guess the byte-granularity shuffle unit is 64b wide.)

pshuflw xmm is 1 uop fast, but vpshuflw ymm is 4 uops, 8c throughput.

Video encoding on KNL might be barely worth it with 128-bit AVX (vpsadbw xmm is fast), but AVX2 ymm instructions are generally slower than using more 1 uop xmm instructions.
movss/sd xmm,xmm is a blend, not a shuffle, and has 0.5c throughput / 2c latency.
vpunpcklbw / wd are super slow (except the xmm version), but DQ and QDQ are the regular speed even for ymm / zmm operand size. (2c throughput / 4-7c latency, because it’s a 2-input shuffle).
vpmovzx is 3c latency (not 3-6c?) and 2c throughput even for vpmovzxbw. vpmovsx is slower: 2 uops and thus a decode bottleneck, making it 8c latency and 7c throughput. The narrowing truncate instructions (vpmovqb and so on) are 1 uop, 3c lat / 1c tput, but the narrowing saturate instructions are 2 uops and thus slow. Agner didn’t test them with a memory destination.

Xeon Phi (discontinued)

SKX: Skylake-AVX512 (and probably future mainstream Intel CPUs)

Knight’s Landing:

More Related Contents:

Leave a Comment Cancel reply