micro-architecture - w3toppers.com

Are two store buffer entries needed for split line/page stores on recent Intel?

Are load ops deallocated from the RS when they dispatch, complete or some other time?

The following experiments suggest that the uops are deallocated at some point before the load completes. While this is not a complete answer to your question, it might provide some interesting insights. On Skylake, there is a 33-entry reservation station for loads (see https://stackoverflow.com/a/58575898/10461973). This should also be the case for the Coffee Lake i7-8700K, … Read more

How do the store buffer and Line Fill Buffer interact with each other?

Why do we need the Line Fill Buffer if the store buffer already exists to track outsanding store requests? The store buffer is used to track stores, in order, both before they retire and after they retire but before they commit to the L1 cache2. The store buffer conceptually is a totally local thing which … Read more

how are barriers/fences and acquire, release semantics implemented microarchitecturally?

Much of this has been covered in other Q&As (especially the later C++ How is release-and-acquire achieved on x86 only using MOV?), but I’ll give a summary here. Still, good question, it’s useful to collect this all in one place. On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW … Read more

How modern X86 processors actually compute multiplications?

Mitch Alsup (who worked on Motorola 88K, Ross SPARC, AMD x86, etc.) has stated on the comp.arch newsgroup: All modern multiplier designers use the Dadda method for building the tree. (Message-ID: <[email protected]> — 14 December 2018) and (with respect to availability of recent references for what multiplication mechanisms are used by AMD/Intel/NVIDIA): Only in the … Read more

Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?

Tl;DR: For these three cases, a penalty of a few cycles is incurred when performing a load and store at the same time. The load latency is on the critical path in all of the three cases, but the penalty is different in different cases. Case 3 is about a cycle higher than case 1 … Read more