GCC memory barrier __sync_synchronize vs asm volatile(“”: : :”memory”)

There’s a significant difference – the first option (inline asm) actually does nothing at runtime, there’s no command performed there and the CPU doesn’t know about it. it only serves at compile time, to tell the compiler not to move loads or stores beyond this point (in any direction) as part of its optimizations. It’s called a SW barrier.

The second barrier (builtin sync), would simply translate into a HW barrier, probably a fence (mfence/sfence) operations if you’re on x86, or its equivalents in other architectures. The CPU may also do various optimizations at runtime, the most important one is actually performing operations out-of-order – this instruction tells it to make sure that loads or stores can’t pass this point and must be observed in the correct side of the sync point.

Here’s another good explanation:

Types of Memory Barriers

As mentioned above, both compilers and processors can optimize the execution of instructions in a way that necessitates the use of a
memory barrier. A memory barrier that affects both the compiler and
the processor is a hardware memory barrier, and a memory barrier that
only affects the compiler is a software memory barrier.

In addition to hardware and software memory barriers, a memory barrier
can be restricted to memory reads, memory writes, or both. A memory
barrier that affects both reads and writes is a full memory barrier.

There is also a class of memory barrier that is specific to
multi-processor environments. The name of these memory barriers are
prefixed with “smp”. On a multi-processor system, these barriers are
hardware memory barriers and on uni-processor systems, they are
software memory barriers.

The barrier() macro is the only software memory barrier, and it is a
full memory barrier. All other memory barriers in the Linux kernel are
hardware barriers. A hardware memory barrier is an implied software
barrier.

An example for when SW barrier is useful: consider the following code –

for (i = 0; i < N; ++i) {
    a[i]++;
}

This simple loop, compiled with optimizations, would most likely be unrolled and vectorized.
Here’s the assembly code gcc 4.8.0 -O3 generated packed (vector) operations:

400420:       66 0f 6f 00             movdqa (%rax),%xmm0
400424:       48 83 c0 10             add    $0x10,%rax
400428:       66 0f fe c1             paddd  %xmm1,%xmm0
40042c:       66 0f 7f 40 f0          movdqa %xmm0,0xfffffffffffffff0(%rax)
400431:       48 39 d0                cmp    %rdx,%rax
400434:       75 ea                   jne    400420 <main+0x30>

However, when adding your inline assembly on each iteration, gcc is not permitted to change the order of the operations past the barrier, so it can’t group them, and the assembly becomes the scalar version of the loop:

400418:       83 00 01                addl   $0x1,(%rax)
40041b:       48 83 c0 04             add    $0x4,%rax
40041f:       48 39 d0                cmp    %rdx,%rax
400422:       75 f4                   jne    400418 <main+0x28>

However, when the CPU performes this code, it’s permitted to reorder the operations “under the hood”, as long as it does not break memory ordering model. This means that performing the operations can be done out of order (if the CPU supports that, as most do these days). A HW fence would have prevented that.

Leave a Comment