Half-precision floating-point arithmetic on Intel chips

related: https://scicomp.stackexchange.com/questions/35187/is-half-precision-supported-by-modern-architecture – has some info about BFloat16 in Cooper Lake and Sapphire Rapids, and some non-Intel info. Sapphire Rapids will have both BF16 and FP16, with FP16 using the same IEEE754 binary16 format as F16C conversion instructions, not brain-float. And AVX512-FP16 has support for most math operations, unlike BF16 which just has conversion to/from … Read more

Disable AVX-optimized functions in glibc (LD_HWCAP_MASK, /etc/ld.so.nohwcap) for valgrind & gdb record

It looks like there is a nice workaround for this implemented in recent versions of glibc: a “tunables” feature that guides selection of optimized string functions. You can find a general overview of this feature here and the relevant code inside glibc in ifunc-impl-list.c. Here’s how I figured it out. First, I took the address … Read more

Mathematical functions for SIMD registers

Libmvec is a x86_64 glibc library with SSE4, AVX, AVX2, and AVX-512 vectorized functions for cos, exp, log, sin, pow, and sincos, in single precision and double precision. The accuracy of these functions is 4-ulp maximum relative error. Usually gcc inserts calls to Libmvec functions, such as _ZGVdN4v_cos, while it is auto-vectorizing scalar code, for … Read more

Do 128bit cross lane operations in AVX512 give better performance?

Generally yes, in-lane is still lower latency on SKX (1 cycle vs. 3), but usually it’s not worth spending extra instructions to use them instead of the powerful lane-crossing shuffles. However, vpermt2w and a couple other shuffles need multiple shuffle-port uops, so they cost as much as multiple simpler shuffles. Shuffle throughput very easily becomes … Read more

Preventing GCC from automatically using AVX and FMA instructions when compiled with -mavx and -mfma

What you want to do is compile different object files for each instruction set you are targeting. Then create a cpu dispatcher which asks CPUID for the available instruction set and then jumps to the appropriate version of the function. I already described this in several different questions and answers disable-avx2-functions-on-non-haswell-processors do-i-need-to-make-multiple-executables-for-targetting-different-instruction-set how-to-check-with-intel-intrinsics-if-avx-extensions-is-supported-by-the-cpu cpu-dispatcher-for-visual-studio-for-avx-and-sse create-separate-object-files-from-the-same-source-code-and-link-to-an-executable