x86 - w3toppers.com

how are barriers/fences and acquire, release semantics implemented microarchitecturally?

Much of this has been covered in other Q&As (especially the later C++ How is release-and-acquire achieved on x86 only using MOV?), but I’ll give a summary here. Still, good question, it’s useful to collect this all in one place. On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW … Read more

How to access the control registers cr0,cr2,cr3 from a program? Getting segmentation fault

Quoting from Intel® 64 and IA-32 Architectures Software Developer Manuals 3-650 Vol. 2A on moving to and from control registers: This instruction can be executed only when the current privilege level is 0. Which means the instruction can only be executed in kernel mode. A minimal kernel module, that logs the contents of cr0, cr2 … Read more

Load address calculation when using AVX2 gather instructions

Gather instructions do not have any alignment requirements. So it would be too restrictive not to allow byte addressing. Other reason is consistency. With SIB addressing we obviously have byte address: MOV eax, [rcx + rdx * 2] Since VPGATHERDD is just a vectorized variant of this MOV instruction, we should not expect anything different … Read more

SSE multiplication of 4 32-bit integers

If you need signed 32×32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want: static inline __m128i muly(const __m128i &a, const __m128i &b) { __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/ __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */ return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE … Read more

Fastest way to do horizontal vector sum with AVX instructions [duplicate]

If you have two __m256d vectors x1 and x2 that each contain four doubles that you want to horizontally sum, you could do: __m256d x1, x2; // calculate 4 two-element horizontal sums: // lower 64 bits contain x1[0] + x1[1] // next 64 bits contain x2[0] + x2[1] // next 64 bits contain x1[2] + … Read more

Is there hardware support for 128bit integers in modern processors?

The x86-64 instruction set can do 64-bit*64-bit to 128-bit using one instruction (mul for unsigned imul for signed each with one operand) so I would argue that to some degree that the x86 instruction set does include some support for 128-bit integers. If your instruction set does not have an instruction to do 64-bit*64-bit to … Read more

Branch target prediction in conjunction with branch prediction?

Do read along with the Intel optimization manual, current download location is here. When stale (they move stuff around all the time) then search the Intel site for “Architectures optimization manual”. Keep in mind the info there is fairly generic, they disclose only as much as needed to allow writing efficient code. Branch prediction implementation … Read more

What is the difference between Trap and Interrupt?

A trap is an exception in a user process. It’s caused by division by zero or invalid memory access. It’s also the usual way to invoke a kernel routine (a system call) because those run with a higher priority than user code. Handling is synchronous (so the user code is suspended and continues afterwards). In … Read more

Convention for displaying vector registers

Being consistent is the most important thing; If I’m working on existing code that already has LSE-first comments or variable names, I match that. Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes. Intel uses MSE-first not only in their diagrams in … Read more

Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?

No, there are some instructions that can only decode 1/clock This effect is Intel-only, not AMD. Theory: the “steering” logic that sends chunks of machine code to decoders looks for patterns in the opcode byte(s) during pre-decode, and any pattern-match that might be a multi-uop instructions has to get sent to the complex decoder. To … Read more