avx - w3toppers.com

Half-precision floating-point arithmetic on Intel chips

related: https://scicomp.stackexchange.com/questions/35187/is-half-precision-supported-by-modern-architecture – has some info about BFloat16 in Cooper Lake and Sapphire Rapids, and some non-Intel info. Sapphire Rapids will have both BF16 and FP16, with FP16 using the same IEEE754 binary16 format as F16C conversion instructions, not brain-float. And AVX512-FP16 has support for most math operations, unlike BF16 which just has conversion to/from … Read more

Disable AVX-optimized functions in glibc (LD_HWCAP_MASK, /etc/ld.so.nohwcap) for valgrind & gdb record

It looks like there is a nice workaround for this implemented in recent versions of glibc: a “tunables” feature that guides selection of optimized string functions. You can find a general overview of this feature here and the relevant code inside glibc in ifunc-impl-list.c. Here’s how I figured it out. First, I took the address … Read more

Mathematical functions for SIMD registers

Libmvec is a x86_64 glibc library with SSE4, AVX, AVX2, and AVX-512 vectorized functions for cos, exp, log, sin, pow, and sincos, in single precision and double precision. The accuracy of these functions is 4-ulp maximum relative error. Usually gcc inserts calls to Libmvec functions, such as _ZGVdN4v_cos, while it is auto-vectorizing scalar code, for … Read more

Do 128bit cross lane operations in AVX512 give better performance?

Generally yes, in-lane is still lower latency on SKX (1 cycle vs. 3), but usually it’s not worth spending extra instructions to use them instead of the powerful lane-crossing shuffles. However, vpermt2w and a couple other shuffles need multiple shuffle-port uops, so they cost as much as multiple simpler shuffles. Shuffle throughput very easily becomes … Read more

Preventing GCC from automatically using AVX and FMA instructions when compiled with -mavx and -mfma

What you want to do is compile different object files for each instruction set you are targeting. Then create a cpu dispatcher which asks CPUID for the available instruction set and then jumps to the appropriate version of the function. I already described this in several different questions and answers disable-avx2-functions-on-non-haswell-processors do-i-need-to-make-multiple-executables-for-targetting-different-instruction-set how-to-check-with-intel-intrinsics-if-avx-extensions-is-supported-by-the-cpu cpu-dispatcher-for-visual-studio-for-avx-and-sse create-separate-object-files-from-the-same-source-code-and-link-to-an-executable

Difference between the AVX instructions vxorpd and vpxor

Combining some comments into an answer: Other than performance, they have identical behaviour (I think even with a memory argument: same lack of alignment requirements for all AVX instructions). On Nehalem to Broadwell, (V)PXOR can run on any of the 3 ALU execution ports, p0/p1/p5. (V)XORPS/D can only run on p5. Some CPUs have a … Read more

How to use AVX/pclmulqdq on Mac OS X

A simpler solution that fixed this problem for me was adding -Wa,-q to the compiler flags. From the man pages for as (version 1.38): -q Use the clang(1) integrated assembler instead of the GNU based system assembler. The -Wa part passes it from the compiler driver to the assembler, much like -Wl passes arguments … Read more

Optimizations for pow() with const non-integer exponent?

Another answer because this is very different from my previous answer, and this is blazing fast. Relative error is 3e-8. Want more accuracy? Add a couple more Chebychev terms. It’s best to keep the order odd as this makes for a small discontinuity between 2^n-epsilon and 2^n+epsilon. #include <stdlib.h> #include <math.h> // Returns x^(5/12) for … Read more

Fastest way to set __m256 value to all ONE bits

See also Set all bits in CPU register to 1 efficiently which covers AVX, AVX2, and AVX512 zmm and k (mask) registers. You obviously didn’t even look at the asm output, which is trivial to do: #include <immintrin.h> __m256i all_ones(void) { return _mm256_set1_epi64x(-1); } compiles to with GCC and clang with any -march that includes … Read more

Find the first instance of a character using simd

You have the right idea with _mm256_cmpeq_epi8 -> _mm256_movemask_epi8. AFAIK, that’s the optimal way to implement this for Intel CPUs at least. PMOVMSKB r32, ymm is the same speed as the XMM 16-byte version, so it would be a huge loss to unpack the two lanes of a 256b vector and movemask them separately and … Read more