Do 128bit cross lane operations in AVX512 give better performance?
Generally yes, in-lane is still lower latency on SKX (1 cycle vs. 3), but usually it’s not worth spending extra instructions to use them instead of the powerful lane-crossing shuffles. However, vpermt2w and a couple other shuffles need multiple shuffle-port uops, so they cost as much as multiple simpler shuffles. Shuffle throughput very easily becomes … Read more