memory-barriers - w3toppers.com

Fastest inline-assembly spinlock

Although there is already an accepted answer, there are a few things that where missed that could be used to improve all the answers, taken from this Intel article, all above fast lock implementation: Spin on a volatile read, not an atomic instruction, this avoids unneeded bus locking, especially on highly contended locks. Use back-off … Read more

Atomicity of loads and stores on x86

It sounds like the atomic operations on memory will be executed directly on memory (RAM). Nope, as long as every possible observer in the system sees the operation as atomic, the operation can involve cache only. Satisfying this requirement is much more difficult for atomic read-modify-write operations (like lock add [mem], eax, especially with an … Read more

Analyzing of x86 output generated by JIT in the context of volatile

A couple of things, first will be flushed to memory – that’s pretty erroneous. It’s almost never a flush to main memory – it usually drains the StoreBuffer to L1 and it’s up to the cache coherency protocol to sync the data between all caches, but if it’s easier for you to understand this concept … Read more

Dependent loads reordering in CPU

Short answer: In an out-of-order processor the load-store queue is used to track and enforce memory ordering constraints. Processors such as the Alpha 21264 have the necessary hardware to prevent dependent load reordering, but enforcing this dependency could add overhead for inter-processor communication. Long answer: Background on dependence tracking This is probably best explained using … Read more

how are barriers/fences and acquire, release semantics implemented microarchitecturally?

Much of this has been covered in other Q&As (especially the later C++ How is release-and-acquire achieved on x86 only using MOV?), but I’ll give a summary here. Still, good question, it’s useful to collect this all in one place. On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW … Read more

Should thread-safe class have a memory barrier at the end of its constructor?

Lazy<T> is a very good choice for Thread-Safe Initialization. I think it should be left to the consumer to provide that: var queue = new Lazy<ThreadSafeQueue<int>>(() => new ThreadSafeQueue<int>()); Parallel.For(0, 10000, i => { else if (i % 2 == 0) queue.Value.Enqueue(i); else { int item = -1; if (queue.Value.TryDequeue(out item) == true) Console.WriteLine(item); } … Read more

C++ Memory Barriers for Atomics

Both MemoryBarrier (MSVC) and _mm_mfence (supported by several compilers) provide a hardware memory fence, which prevents the processor from moving reads and writes across the fence. The main difference is that MemoryBarrier has platform specific implementations for x86, x64 and IA64, where as _mm_mfence specifically uses the mfence SSE2 instruction, so it’s not always available. … Read more

Does std::mutex create a fence?

As I understand this is covered in: 1.10 Multi-threaded executions and data races Para 5: The library deﬁnes a number of atomic operations (Clause 29) and operations on mutexes (Clause 30) that are specially identiﬁed as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization … Read more

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

Basically no significant effect on inter-core latency, and definitely never worth using “blindly” without careful profiling, if you suspect there might be any contention from later loads missing in cache. It’s a common misconception that asm barriers are needed to make the store buffer commit to cache. In fact barriers just make this core wait … Read more

Acquire/release semantics with 4 threads

You are thinking in terms of sequential consistency, the strongest (and default) memory order. If this memory order is used, all accesses to atomic variables constitute a total order, and the assertion indeed cannot be triggered. However, in this program, a weaker memory order is used (release stores and acquire loads). This means, by definition … Read more