Difference between the AVX instructions vxorpd and vpxor

Combining some comments into an answer:

Other than performance, they have identical behaviour (I think even with a memory argument: same lack of alignment requirements for all AVX instructions).

On Nehalem to Broadwell, (V)PXOR can run on any of the 3 ALU execution ports, p0/p1/p5. (V)XORPS/D can only run on p5.

Some CPUs have a “bypass delay” between integer and FP “domains”. Agner Fog’s microarch docs say that on SnB / IvB, the bypass delay is sometimes zero. e.g. when using the “wrong” type of shuffle or boolean operation. On Haswell, his examples show that orps has no extra latency when used on the result of an integer instruction, but that por has an extra 1 clock of latency when used on the result of addps.

On Skylake, FP booleans can run on any port, but bypass delay depends on which port they happened to run on. (See Intel’s optimization manual for a table). Port5 has no bypass delay between FP math ops, but port 0 or port 1 do. Since the FMA units are on port 0 and 1, the uop issue stage will usually assign booleans to port5 in FP heavy code, because it can see that lots of uops are queued up for p0/p1 but p5 is less busy. (How are x86 uops scheduled, exactly?).

I’d recommend not worrying about this. Tune for Haswell and Skylake will do fine. Or just always use VPXOR on integer data and VXORPS on FP data, and Skylake will do fine (but Haswell might not).

On AMD Bulldozer / Piledriver / Steamroller there is no “FP” version of the boolean ops. (see pg. 182 of Agner Fog’s microarch manual.) There’s a delay for forwarding data between execution units (of 1 cycle for ivec->fp or fp->ivec, 10 cycles for int->ivec (eax -> xmm0), 8 cycles for ivec->int. (8,10 on bulldozer. 4, 5 on steamroller for movd/pinsrw/pextrw)) So anyway, you can’t avoid the bypass delay on AMD by using the appropriate boolean insn. XORPS does take one less byte to encode than PXOR or XORPD (non-VEX version. VEX versions all take 4 bytes.)

In any case, bypass delays are just extra latency, not reduced throughput. If these ops aren’t part of the longest dep chain in your inner loop, or if you can interleave two iterations in parallel (so you have multiple dependency chains going at once for out-of-order-execution), then PXOR may be the way to go.

On Intel CPUs before Skylake, packed-integer instructions can always run on more ports than their floating-point counterparts, so prefer integer ops.

More Related Contents:

Leave a Comment Cancel reply