C++11: the difference between memory_order_relaxed and memory_order_consume

Question 1

No.
memory_order_relaxed imposes no memory order at all:

Relaxed operation: there are no synchronization or ordering constraints, only atomicity is required of this operation.

While memory_order_consume imposes memory ordering on data dependent reads (on the current thread)

A load operation with this memory order performs a consume operation on the affected memory location: no reads in the current thread dependent on the value currently loaded can be reordered before this load.

Edit

In general memory_order_seq_cst is stronger memory_order_acq_rel is stronger memory_ordering_relaxed.
This is like having a Elevator A that can lift 800 Kg Elevator C that lifts 100Kg.
Now if you had the power to magically change Elevator A into Elevator C, what would happen if the former was filled with 10 average-weighting people?
That would be bad.

To see what could go wrong with the code exactly, consider the example on your question:

Thread A                                   Thread B
Payload = 42;                              g = Guard.load(memory_order_consume);
Guard.store(1, memory_order_release);      if (g != 0)
                                               p = Payload;

This snippet are intended to be looped, there is no synchronization, only ordering, between the two threads.

With memory_order_relaxed, and assuming that a natural word load/store is atomic, the code would be equivalent to

Thread A                                   Thread B
Payload = 42;                              g = Guard
Guard = 1                                  if (g != 0)
                                               p = Payload;

From a CPU point of view on Thread A there are two stores to two separate addresses, so if Guard is “closer” to the CPU (meaning the store will complete faster) from another processor it seems that Thread A is perfoming

Thread A
Guard = 1
Payload = 42

And this order of execution is possible

Thread A   Guard = 1
Thread B   g = Guard
Thread B   if (g != nullptr) p = Payload
Thread A   Payload = 42

And that’s bad, since Thread B read a non updated value of Payload.

It could seems however that in Thread B the synchronization would be useless since the CPU won’t do a reorder like

Thread B
if (g != 0) p = Payload;
g = Guard

But it actually will.

From its perspective there are two unrelated load, it is true that one is on a dependent data path but the CPU can still speculatively do the load:

Thread B
hidden_tmp = Payload;
g = Guard
if (g != 0) p = hidden_tmp

That may generate the sequence

Thread B   hidden_tmp = Payload;
Thread A   Payload = 42;
Thread A   Guard = 1;
Thread B   g = Guard
Thread B   if (g != 0) p = hidden_tmp

Whoops.

Question 2

In general that can never be done.
You can replace memory_order_acquire with memory_order_consume when you are going to generate an address dependency between the loaded value and the value(s) whose access need to be ordered.


To understand memory_order_relaxed we can take the ARM architecture as a reference.
The ARM Architecture mandates only a weak memory ordering meaning that in general the loads and stores of a program can be executed in any order.

str r0, [r2]
str r0, [r3]

In the snippet above the store to [r3] can be observed, externally, before the store to [r2]1.

However the CPU doesn’t go as far as the Alpha CPU and imposes two kinds of dependencies: address dependency, when a value load from memory is used to compute the address of another load/store, and control dependency, when a value load from memory is used to compute the control flags of another load/store.

In the presence of such dependency the ordering of two memory operations is guaranteed to be visible in program order:

If there is an address dependency then the two memory accesses are observed in program order.

So, while a memory_order_acquire would generate a memory barrier, with memory_order_consume you are telling the compiler that the way you’ll use the loaded value will generate an address dependency and so it can, if relevant to the architecture, exploit this fact and omit a memory barrier.


1 If r2 is the address of a synchronization object, that’s bad.

Leave a Comment