Will two atomic writes to different locations in different threads always be seen in the same order by other threads?

This kind of reordering test is called IRIW (Independent Readers, Independent Writers), where we’re checking if two readers can see the same pair of stores appear in different orders. Related, maybe a duplicate: Acquire/release semantics with 4 threads

The very weak C++11 memory model does not require that all threads agree on a global order for stores, as @MWid’s answer says.

This answer will explain one possible hardware mechanism that can lead to threads disagreeing about the global order of stores, which may be relevant when setting up tests for lockless code. And just because it’s interesting if you like cpu-architecture¹.

See A Tutorial Introduction to the ARM and POWER Relaxed Memory Models for an abstract model of what those ISAs: Neither ARM nor POWER guarantee of a consistent global store order seen by all threads. Actually observing this is possible in practice on POWER chips, and maybe possible in theory on ARM but maybe not on any actual implementations.

(Other weakly-ordered ISAs like Alpha also allow this reordering, I think. ARM used to allow it on-paper, but probably no real implementations did this reordering. ARMv8 even strengthened their on-paper model to disallow this even for future hardware.)

In computer science, the term for a machine where stores become visible to all other threads at the same time (and thus there is a single global order of stores) is “multiple-copy atomic” or “multi-copy atomic”. x86 and SPARC’s TSO memory models have that property, but ARM and POWER don’t require it.

Current SMP machines use MESI to maintain a single coherent cache domain so that all cores have the same view of memory. Stores become globally visible when they commit from the store buffer into L1d cache. At that point a load from any other core will see that store. There is a single order of all stores committing to cache, because MESI maintains a single coherency domain. With sufficient barriers to stop local reordering, sequential consistency can be recovered.

A store can become visible to some but not all other cores before it becomes globally visible.

POWER CPUs use Simultaneous MultiThreading (SMT) (the generic term for hyperthreading) to run multiple logical cores on one physical core. The memory-ordering rules we care about are for logical cores that threads run on, not physical cores.

We normally think of loads as taking their value from L1d, but that’s not the case when reloading a recent store from the same core and data is forwarded directly from the store buffer. (Store-to-load forwarding, or SLF). It’s even possible for a load to get a value that was never present in L1d and never will be, even on strongly-ordered x86, with partial SLF. (See my answer on Globally Invisible load instructions).

The store buffer tracks speculative stores before the store instruction has retired, but also buffers non-speculative stores after they retire from the out-of-order-execution part of the core (the ROB / ReOrder Buffer).

The logical cores on the same physical core share a store buffer. Speculative (not-yet-retired) stores must stay private to each logical core. (Otherwise that would couple their speculation together and require both to roll-back if a mis-speculation were detected. That would defeat part of the purpose of SMT, of keeping the core busy while one thread is stalled or recovering from a branch mispredict).

But we can let other logical cores snoop the store buffer for non-speculative stores that will definitely commit to L1d cache eventually. Until they do, threads on other physical cores can’t see them, but logical cores sharing the same physical core can.

(I’m not sure this is exactly the HW mechanism that allows this weirdness on POWER, but it’s plausible).

This mechanism makes stores visible to SMT sibling cores before they’re globally visible to all cores. But it’s still local within the core, so this reordering can be cheaply avoided with barriers that just affect the store buffer, without actually forcing any cache interactions between cores.

(The abstract memory model proposed in the ARM/POWER paper models this as each core having its own cached view of memory, with links between caches that let them sync. But in typical physical modern hardware, I think the only mechanism is between SMT siblings, not between separate cores.)

Note that x86 can’t allow other logical cores to snoop the store buffer at all because that would violate x86’s TSO memory model (by allowing this weird reordering). As my answer on What will be used for data exchange between threads are executing on one Core with HT? explains, Intel CPUs with SMT (which Intel calls Hyperthreading) statically partition the store buffer between logical cores.

Footnote 1: An abstract model for C++, or for asm on a particular ISA, is all you really need to know to reason about memory ordering.

Understanding the hardware details isn’t necessary (and can lead you into a trap of thinking something’s impossible just because you can’t imagine a mechanism for it).

More Related Contents:

Leave a Comment Cancel reply