Why do I see 400x outlier timings when calling clock_gettime repeatedly?

Question

Why Outliers?

There are many software and hardware related reasons why you might see outlier events (and non-outlier variation) when you iterate 10 million times over two clock_gettime calls. These reasons include:

Context switches: the scheduler may decide to migrate your process between CPUs, and even if you pin your process to a CPU, the OS may periodically decide to run something else on your logical CPU.
SMT: assuming this is on a CPU with SMT (e.g., hyperthreading on x86) the scheduler will probably periodically schedule something on the sibling core (same physical core as your process). This can dramatically affect the overall performance of your code since two threads are competing for the same core resources. Futhermore, there is probably a transition period between SMT and non-SMT execution where nothing executes since the core has to re-paritition some resources when SMT execution begins.
Interrupts: A typical system will receiving hundreds of interrupts per second at a minimum, from the network card, graphics devices, hardware clocks, system timers, audio devices, IO devices, cross-CPU IPIs, and so on. Try a watch -n1 cat /proc/interrupts and see how action is occurring on what you might think is an otherwise idle system.
Hardware pauses: the CPU itself may periodically stop executing instructions for a variety of reasons such as power or thermal throttling, or just because the CPU is undergoing a frequency transition.
System Management Mode: totally apart from interrupts seen and handled by the OS, x86 CPUs have a type of “hidden interrupt” which allows SMM functionality to execute on your CPU, with the only apparent affect being periodic unexpected jumps in cycle counters used to measure real time.
Normal performance variations: your code won’t execute in exactly the same way every time. Initial iterations will suffer data and instruction cache misses, and have untrained predictors for things like branch direction. Even in an apparent “steady state” you may still suffer performance variations from things beyond your control.
Different code paths: you might expect your loop to execute exactly the same instructions every time through¹: after all, nothing is really changing, right? Well if you dig into the internals of clock_gettime you may very well find something branches that take a different path when some times of overflow occurs, or when reading from the adjustment factors in the VDSO races with an update, etc.

That’s not even a comprehensive list, but it should at least give you a taste of some of the factors that can cause outliers. You can eliminate or reduce the effect of some of these, but complete control is generally impossible on a modern non-realtime² OS on x86.

My Guess

If I had to take a guess, based on a typical outlier of ~8000 ns, which is probably too small for a context switch interruption, you are probably seeing the effect of processor frequency scaling due to variable TurboBoost ratios. That’s a mouthful, but basically modern x86 chips run at different “max turbo” speeds depending on how many cores are active. My i7-6700HQ, for example, will run at 3.5 GHz if one core is active, but only 3.3, 3.2 or 3.1 GHz if 2, 3 or 4 cores are active, respectively.

This means that even if your process is never interrupted, any work at all which runs even briefly on another CPU may cause a frequency transition (e.g., because you transition fromm 1 to 2 active cores), and during such a transition the CPU is idled for thousands of cycles while voltages stabilize. You can find some detailed numbers and tests in this answer but the upshot is that on the tested CPU the stabilization takes roughly 20,000 cycles, very much in line with your observed outliers of ~8000 nanoseconds. Sometimes you might get two transitions in a period which doubles the impact, and so on.

Narrow It Down

Get a Distribution

If you still want to know the cause of your outliers, you can take the following steps and observe the effect on the outlier behavior.

First, you should collect more data. Rather than just recoding the max over 10,000,000 iterations, you should collect a histogram with some reasonable bucket size (say 100 ns, or even better some type of geometric bucket size that gives higher resolution for shorter times). This will be a huge help because you’ll be able to see exactly where the times are clustering: it is entirely possible that you have other effects other than the 6000 – 17000 ns outliers that you note with “max”, and they can have different causes.

A histogram also lets you understand the outlier frequency, which you can correlate with frequencies of things you can measure to see if they match up.

Now adding the histogram code also potentially adds more variance to the timing loop, since (for example) you’ll be accessing different cache lines depending on the timing value, but this is manageable, especially because the recording of the time happens outside the “timed region”.

Issue Specific Mitigations

With that in hand, you can try to systematically check the issues I mentioned above to see if they are the cause. Here are some ideas:

Hyperthreading: Just turn it off in the BIOS while running single-threaded benchmarks which eliminates that whole class of issues in one move. In general, I’ve found that this also leads to a giant reduction in fine-grained benchmark variance, so it’s a good first step.
Frequency scaling: On Linux, you can usually disable sub-nominal frequency scaling by setting the performance governor to “performance”. You can disable super-nominal (aka turbo) by setting /sys/devices/system/cpu/intel_pstate/no_turbo to 0 if you’re using the intel_pstate driver. You can also manipulate the turbo mode directly via MSR if you have another driver, or you can do it in the BIOS if all else fails. In the linked question the outliers basically disapear when turbo is disabled, so that’s something to try first.

Assuming you actually want to keep using turbo in production, you can limit the max turbo ratio manually to some value that applies to N cores (e.g,. 2 cores), and then offline the other CPUs so at most that number of cores will ever be active. Then you’ll be able to run at your new max turbo all the time no matter how many cores are active (of course, you might still be subject to power, current or thermal limits in some cases).
Interrupts: you can search for “interrupt affinity” to try to move interrupts to/from your pinned core and see the effect on the outlier distribution. You can also count the number of interrupts (e.g., via /proc/interrupts) and see the count is enough to explain the outlier count. If you find that timer interrupts specifically are the cause, you can explore the various “tickless” (aka “NOHZ”) modes your kernel offers to reduce or eliminate them. You can also count them directly via the HW_INTERRUPTS.RECEIVED performance counter on x86.
Context switches: you can use realtime priorities or isolcpus to prevent other processes from running on your CPU. Keep in mind that context switch issues, while usually positioned as the main/only issue, are actually fairly rare: at most they generally happen at the HZ rate (often 250/second on modern kernels) – but it will be rare on a mostly idle system that the scheduler would actually decide to scheduler another process on your busy CPU. If you make your benchmark loops short, you can generally almost entirely avoid context switches.
Code related performance variations: you can check if this is happening with various profiling tools like perf. You can carefully design the core of your packet handling code to avoid outlier events like cache misses, e.g., by pre-touching caching lines, and you could avoid the use of system calls with unknown complexity as much as possible.

While some of the above are purely for investigative purposes, many of them will both help you determine what’s causing the pauses and also mitigate them.

I’m not aware of mitigations for all issues however – stuff like SMM you’d perhaps need specialized hardware or BIOS to avoid.

¹ Well except perhaps in the case that the if( (mtime2-mtime)> m_TSmax ) condition is triggered – but this should be rare (and perhaps your compiler has made it branch-free, in which case there is only one execution path).

² It’s not actually clear you can get to “zero variance” even with a hard realtime OS: some x86-specific factors like SMM mode and DVFS related stalls seem unavoidable.

Why Outliers?

My Guess

Narrow It Down

Get a Distribution

Issue Specific Mitigations

More Related Contents:

Leave a Comment Cancel reply