avx - w3toppers.com

practical BigNum AVX/SSE possible?

I think it may be possible to implement BigNum with SIMD efficiently but not in the way you suggest. Instead of implementing a single BigNum using a SIMD register (or with an array of SIMD registers) you should process multiple BigNums at once. Let’s consider 128-bit addition. Let 128-bit integers be defined by a pair … Read more

How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all

How to check if a CPU supports the SSE3 instruction set?

I’ve created a GitHub repro that will detect CPU and OS support for all the major x86 ISA extensions: https://github.com/Mysticial/FeatureDetector Here’s a shorter version: First you need to access the CPUID instruction: #ifdef _WIN32 // Windows #define cpuid(info, x) __cpuidex(info, x, 0) #else // GCC Intrinsics #include <cpuid.h> void cpuid(int info[4], int InfoType){ __cpuid_count(InfoType, 0, … Read more

How to efficiently perform double/int64 conversions with SSE/AVX?

There’s no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like _mm512_cvtpd_epi64 and the narrower AVX512VL versions, like _mm256_cvtpd_epi64. If you only have AVX2 or less, you’ll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double … Read more

is there an inverse instruction to the movemask instruction in intel avx2?

What are the best instruction sequences to generate vector constants on the fly?

How to solve the 32-byte-alignment issue for AVX load/store operations?

Yes, you can use _mm256_loadu_ps / storeu for unaligned loads/stores (AVX: data alignment: store crash, storeu, load, loadu doesn’t). If the compiler doesn’t do a bad job (cough GCC default tuning), AVX _mm256_loadu/storeu on data that happens to be aligned is just as fast as alignment-required load/store, so aligning data when convenient still gives you … Read more

Why doesn’t gcc resolve _mm256_loadu_pd as single vmovupd?

GCC’s default tuning (-mtune=generic) includes -mavx256-split-unaligned-load and -mavx256-split-unaligned-store, because that gives a minor speedup on some CPUs (e.g. first-gen Sandybridge, and some AMD CPUs) in some cases when memory is actually misaligned at runtime. Use -O3 -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store if you don’t want this, or better, use -mtune=haswell. Or use -march=native to optimize for your own … Read more