Although there is already an accepted answer, there are a few things that where missed that could be used to improve all the answers, taken from this Intel article, all above fast lock implementation:
- Spin on a volatile read, not an atomic instruction, this avoids unneeded bus locking, especially on highly contended locks.
- Use back-off for highly contested locks
- Inline the lock, preferably with intrinsics for compilers where inline asm is detrimental (basically MSVC).