sse - w3toppers.com

Difference between MOVDQA and MOVAPS x86 instructions?

In functionality, they are identical. On some (but not all) micro-architectures, there are timing differences due to “domain crossing penalties”. For this reason, one should generally use movdqa when the data is being used with integer SSE instructions, and movaps when the data is being used with floating-point instructions. For more information on this subject, … Read more

Getting started with Intel x86 SSE SIMD instructions

First, I don’t recommend on using the built-in functions – they are not portable (across compilers of the same arch). Use intrinsics, GCC does a wonderful job optimizing SSE intrinsics into even more optimized code. You can always have a peek at the assembly and see how to use SSE to it’s full potential. Intrinsics … Read more

Convention for displaying vector registers

Being consistent is the most important thing; If I’m working on existing code that already has LSE-first comments or variable names, I match that. Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes. Intel uses MSE-first not only in their diagrams in … Read more

SIMD math libraries for SSE and AVX

I have implemented Vecmathlib https://bitbucket.org/eschnett/vecmathlib/ as a generic libraries for two other projects (The Einstein Toolkit, and pocl http://pocl.sourceforge.net/). Vecmathlib is open source, and is written in C++.

Get sum of values stored in __m256d with SSE/AVX

It appears that you’re doing a horizontal sum for every element of an output array. (Perhaps as part of a matmul?) This is usually sub-optimal; try to vectorize over the 2nd-from-inner loop so you can produce result[i + 0..3] in a vector and not need a horizontal sum at all. For a dot-product of an … Read more

How to determine if memory is aligned?

#define is_aligned(POINTER, BYTE_COUNT) \ (((uintptr_t)(const void *)(POINTER)) % (BYTE_COUNT) == 0) The cast to void * (or, equivalenty, char *) is necessary because the standard only guarantees an invertible conversion to uintptr_t for void *. If you want type safety, consider using an inline function: static inline _Bool is_aligned(const void *restrict pointer, size_t byte_count) { … Read more

C++ error: ‘_mm_sin_ps’ was not declared in this scope

_mm_sin_ps is part of the SVML library, shipped with intel compilers only. GCC developers focused on wrapping machine instructions and simple tasks, so there’s no SVML in immintrin.h so far. You have to use a library or write it by yourself. Sinus implementation: Taylor series CORDIC Quadratic curve

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate). An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two. The IEEE and C standards allow this when … Read more

latency vs throughput in intel intrinsics

For a much more complete picture of CPU performance, see Agner Fog’s microarchitecture guide and instruction tables. (Also his Optimizing C++ and Optimizing Assembly guides are excellent). See also other links in the x86 tag wiki, especially Intel’s optimization manual. See also How many CPU cycles are needed for each assembly instruction? and What considerations … Read more