how are barriers/fences and acquire, release semantics implemented microarchitecturally?

Much of this has been covered in other Q&As (especially the later C++ How is release-and-acquire achieved on x86 only using MOV?), but I’ll give a summary here. Still, good question, it’s useful to collect this all in one place. On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW … Read more

How to access the control registers cr0,cr2,cr3 from a program? Getting segmentation fault

Quoting from Intel® 64 and IA-32 Architectures Software Developer Manuals 3-650 Vol. 2A on moving to and from control registers: This instruction can be executed only when the current privilege level is 0. Which means the instruction can only be executed in kernel mode. A minimal kernel module, that logs the contents of cr0, cr2 … Read more

SSE multiplication of 4 32-bit integers

If you need signed 32×32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want: static inline __m128i muly(const __m128i &a, const __m128i &b) { __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/ __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */ return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE … Read more

Branch target prediction in conjunction with branch prediction?

Do read along with the Intel optimization manual, current download location is here. When stale (they move stuff around all the time) then search the Intel site for “Architectures optimization manual”. Keep in mind the info there is fairly generic, they disclose only as much as needed to allow writing efficient code. Branch prediction implementation … Read more

Convention for displaying vector registers

Being consistent is the most important thing; If I’m working on existing code that already has LSE-first comments or variable names, I match that. Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes. Intel uses MSE-first not only in their diagrams in … Read more

Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?

No, there are some instructions that can only decode 1/clock This effect is Intel-only, not AMD. Theory: the “steering” logic that sends chunks of machine code to decoders looks for patterns in the opcode byte(s) during pre-decode, and any pattern-match that might be a multi-uop instructions has to get sent to the complex decoder. To … Read more