As mentioned in a comment, there’s a difference between a compiler barrier and a processor barrier. volatile
and memory
in the asm statement act as a compiler barrier, but the processor is still free to reorder instructions.
Processor barrier are special instructions that must be explicitly given, e.g. rdtscp, cpuid
, memory fence instructions (mfence, lfence,
…) etc.
As an aside, while using cpuid
as a barrier before rdtsc
is common, it can also be very bad from a performance perspective, since virtual machine platforms often trap and emulate the cpuid
instruction in order to impose a common set of CPU features across multiple machines in a cluster (to ensure that live migration works). Thus it’s better to use one of the memory fence instructions.
The Linux kernel uses mfence;rdtsc
on AMD platforms and lfence;rdtsc
on Intel. If you don’t want to bother with distinguishing between these, mfence;rdtsc
works on both although it’s slightly slower as mfence
is a stronger barrier than lfence
.
Edit 2019-11-25: As of Linux kernel 5.4, lfence is used to serialize rdtsc on both Intel and AMD. See this commit “x86: Remove X86_FEATURE_MFENCE_RDTSC”: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be261ffce6f13229dad50f59c5e491f933d3167f