Simple for() loop benchmark takes the same time with any loop bound

BTW, if you’d actually done i<49058349083, gcc and clang create an infinite loop on systems with 32-bit int (including x86 and x86-64). 49058349083 is greater than INT_MAX. Large literal numbers are implicitly promoted to a type large enough to hold them, so you effectively did (int64_t)i < 49058349083LL, which is true for any possible value … Read more

How to benchmark Boost Spirit Parser?

I have given things a quick scan. My profiler quickly told me that constructing the grammar and (especially) the lexer object took quite some resources. Indeed, just changing a single line in SpiritParser.cpp saved 40% of execution time1 (~28s down to ~17s): lexer::Lexer lexer; into static const lexer::Lexer lexer; Now, making the grammar static involves … Read more

Benchmarking inside Java code

Yes it is possible to effectively implement performance benchmarks in java code. The important question is that any kind of performance benchmark is going to add its own overhead and how much of it do you want. System.currentMill..() is good enough benchmark for performance and in most of the cases nanoTime() is an overkill. For … Read more

how do numactl & perf change memory placement policy of child processes?

TL;DR: The default policy used by numactl can cause performances issues as well as the OpenMP thread binding. numactl constraints are applied to all (forked) children process. Indeed, numactl use a predefined policy by default. This policy is can be –interleaved, –preferred, –membind, –localalloc. This policy change the behavior of the operating system page allocation … Read more

Benchmarking – How to count number of instructions sent to CPU to find consumed MIPS

perf stat –all-user ./my_program on Linux will use CPU performance counters to record how many user-space instructions it ran, and how many core clock cycles it took. And how much CPU time it used, and will calculate average instructions per core clock cycle for you, e.g. 3,496,129,612 instructions:u # 2.61 insn per cycle It calculates … Read more

Timing CUDA operations

You could do something along the lines of : #include <sys/time.h> struct timeval t1, t2; gettimeofday(&t1, 0); kernel_call<<<dimGrid, dimBlock, 0>>>(); HANDLE_ERROR(cudaThreadSynchronize();) gettimeofday(&t2, 0); double time = (1000000.0*(t2.tv_sec-t1.tv_sec) + t2.tv_usec-t1.tv_usec)/1000.0; printf(“Time to generate: %3.1f ms \n”, time); or: float time; cudaEvent_t start, stop; HANDLE_ERROR( cudaEventCreate(&start) ); HANDLE_ERROR( cudaEventCreate(&stop) ); HANDLE_ERROR( cudaEventRecord(start, 0) ); kernel_call<<<dimGrid, dimBlock, 0>>>(); … Read more