how are barriers/fences and acquire, release semantics implemented microarchitecturally?

Much of this has been covered in other Q&As (especially the later C++ How is release-and-acquire achieved on x86 only using MOV?), but I’ll give a summary here. Still, good question, it’s useful to collect this all in one place.

On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW speculatively loads earlier than allowed and then checks that speculation. (Potentially resulting in a memory-order mis-speculation pipeline nuke.) To track this, Intel calls the combination of load and store buffers the “Memory Order Buffer”.

Weakly-ordered ISAs don’t have to speculate, they can just load in any order.

x86 store ordering is maintained by only letting stores commit from the store buffer to L1d in program order.

On Intel CPUs at least, a store-buffer entry is allocated for a store when it issues (from the front-end into the ROB + RS). All uops need to have a ROB entry allocated for them, but some uops also need to have other resources allocated, like load or store buffer entries, RAT entries for registers they read/write, and so on.

So I think the store buffer itself is ordered. When a store-address or store-data uop executes, it merely writes an address or data into its already-allocated store-buffer entry. Since commit (freeing SB entries) and allocate are both in program order, I assume it’s physically a circular buffer with a head and tail, like the ROB. (And unlike the RS).

Avoiding LoadStore is basically free: a load can’t retire until it’s executed (taken data from the cache). A store can’t commit until after it retires. In-order retirement automatically means that all previous loads are done before a store is “graduated” and ready for commit.

A weakly-ordered uarch that can in practice do load-store reordering might scoreboard loads as well as tracking them in the ROB: let them retire once they’re known to be non-faulting but, even if the data hasn’t arrived.

This seems more likely on an in-order core, but IDK. So you could have a load that’s retired but the register destination will still stall if anything tries to read it before the data actually arrives. We know that in-order cores do in practice work this way, not requiring loads to complete before later instructions can execute. (That’s why software-pipelining using lots of registers is so valuable on such cores, e.g. to implement a memcpy. Reading a load result right away on an in-order core destroys memory parallelism.)

How is load->store reordering possible with in-order commit? goes into this more deeply, for in-order vs. out-of-order.

Barrier instructions

The only barrier instruction that does anything for regular stores is mfence which in practice stalls memory ops (or the whole pipeline) until the store buffer is drained. Are loads and stores the only instructions that gets reordered? covers the Skylake-with-updated-microcode behaviour of acting like lfence as well.

lfence mostly exists for the microarchitectural effect of blocking later instructions from even issuing until all previous instructions have left the out-of-order back-end (retired). The use-cases for lfence fo memory ordering are nearly non-existent.

C++ How is release-and-acquire achieved on x86 only using MOV?
How is the transitivity/cumulativity property of memory barriers implemented micro-architecturally?
How many memory barriers instructions does an x86 CPU have?
How can I experience “LFENCE or SFENCE can not pass earlier read/write”
Does lock xchg have the same behavior as mfence?
Does the Intel Memory Model make SFENCE and LFENCE redundant?
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths goes into a lot of detail about how LFENCE stops execution of later instructions, and what that means for performance.
When should I use _mm_sfence _mm_lfence and _mm_mfence high-level languages have weaker memory models than x86, so you sometimes only need a barrier that compiles to no asm instructions. Using _mm_sfence() when you haven’t used any NT stores just makes your code slower for no reason than atomic_thread_fence(mo_release).

Barrier instructions

More Related Contents:

Leave a Comment Cancel reply