Simple for() loop benchmark takes the same time with any loop bound

BTW, if you’d actually done i<49058349083, gcc and clang create an infinite loop on systems with 32-bit int (including x86 and x86-64). 49058349083 is greater than INT_MAX. Large literal numbers are implicitly promoted to a type large enough to hold them, so you effectively did (int64_t)i < 49058349083LL, which is true for any possible value … Read more

how do numactl & perf change memory placement policy of child processes?

TL;DR: The default policy used by numactl can cause performances issues as well as the OpenMP thread binding. numactl constraints are applied to all (forked) children process. Indeed, numactl use a predefined policy by default. This policy is can be –interleaved, –preferred, –membind, –localalloc. This policy change the behavior of the operating system page allocation … Read more

Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?

Tl;DR: For these three cases, a penalty of a few cycles is incurred when performing a load and store at the same time. The load latency is on the critical path in all of the three cases, but the penalty is different in different cases. Case 3 is about a cycle higher than case 1 … Read more

Getting an accurate execution time in C++ (micro seconds)

If you are using c++11 or later you could use std::chrono::high_resolution_clock. A simple use case : auto start = std::chrono::high_resolution_clock::now(); … auto elapsed = std::chrono::high_resolution_clock::now() – start; long long microseconds = std::chrono::duration_cast<std::chrono::microseconds>( elapsed).count(); This solution has the advantage of being portable. Beware that micro-benchmarking is hard. It’s very easy to measure the wrong thing (like … Read more