Dependent loads reordering in CPU

Short answer: In an out-of-order processor the load-store queue is used to track and enforce memory ordering constraints. Processors such as the Alpha 21264 have the necessary hardware to prevent dependent load reordering, but enforcing this dependency could add overhead for inter-processor communication. Long answer: Background on dependence tracking This is probably best explained using … Read more

how are barriers/fences and acquire, release semantics implemented microarchitecturally?

Much of this has been covered in other Q&As (especially the later C++ How is release-and-acquire achieved on x86 only using MOV?), but I’ll give a summary here. Still, good question, it’s useful to collect this all in one place. On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW … Read more

Should thread-safe class have a memory barrier at the end of its constructor?

Lazy<T> is a very good choice for Thread-Safe Initialization. I think it should be left to the consumer to provide that: var queue = new Lazy<ThreadSafeQueue<int>>(() => new ThreadSafeQueue<int>()); Parallel.For(0, 10000, i => { else if (i % 2 == 0) queue.Value.Enqueue(i); else { int item = -1; if (queue.Value.TryDequeue(out item) == true) Console.WriteLine(item); } … Read more

C++ Memory Barriers for Atomics

Both MemoryBarrier (MSVC) and _mm_mfence (supported by several compilers) provide a hardware memory fence, which prevents the processor from moving reads and writes across the fence. The main difference is that MemoryBarrier has platform specific implementations for x86, x64 and IA64, where as _mm_mfence specifically uses the mfence SSE2 instruction, so it’s not always available. … Read more

Does std::mutex create a fence?

As I understand this is covered in: 1.10 Multi-threaded executions and data races Para 5: The library defines a number of atomic operations (Clause 29) and operations on mutexes (Clause 30) that are specially identified as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization … Read more

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

Basically no significant effect on inter-core latency, and definitely never worth using “blindly” without careful profiling, if you suspect there might be any contention from later loads missing in cache. It’s a common misconception that asm barriers are needed to make the store buffer commit to cache. In fact barriers just make this core wait … Read more