Are load ops deallocated from the RS when they dispatch, complete or some other time?

The following experiments suggest that the uops are deallocated at some point before the load completes. While this is not a complete answer to your question, it might provide some interesting insights. On Skylake, there is a 33-entry reservation station for loads (see https://stackoverflow.com/a/58575898/10461973). This should also be the case for the Coffee Lake i7-8700K, … Read more

how are barriers/fences and acquire, release semantics implemented microarchitecturally?

Much of this has been covered in other Q&As (especially the later C++ How is release-and-acquire achieved on x86 only using MOV?), but I’ll give a summary here. Still, good question, it’s useful to collect this all in one place. On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW … Read more

How modern X86 processors actually compute multiplications?

Mitch Alsup (who worked on Motorola 88K, Ross SPARC, AMD x86, etc.) has stated on the comp.arch newsgroup: All modern multiplier designers use the Dadda method for building the tree. (Message-ID: <[email protected]> — 14 December 2018) and (with respect to availability of recent references for what multiplication mechanisms are used by AMD/Intel/NVIDIA): Only in the … Read more

Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?

Tl;DR: For these three cases, a penalty of a few cycles is incurred when performing a load and store at the same time. The load latency is on the critical path in all of the three cases, but the penalty is different in different cases. Case 3 is about a cycle higher than case 1 … Read more