benchmarking - w3toppers.com

Is A==0 really better than ~A?

This is not strictly an answer but rather my contribution to the discussion I used the profiler to investigate a slightly-modified version of your code: N_arr = 200:400:3800; %// for medium to large sized input array for k1 = 1:numel(N_arr) A = randi(1,N_arr(k1)); [~]=eq(A,0); clear A A = randi(1,N_arr(k1)); [~]=not(A); clear A end I used … Read more

Simple for() loop benchmark takes the same time with any loop bound

BTW, if you’d actually done i<49058349083, gcc and clang create an infinite loop on systems with 32-bit int (including x86 and x86-64). 49058349083 is greater than INT_MAX. Large literal numbers are implicitly promoted to a type large enough to hold them, so you effectively did (int64_t)i < 49058349083LL, which is true for any possible value … Read more

How to benchmark Boost Spirit Parser?

I have given things a quick scan. My profiler quickly told me that constructing the grammar and (especially) the lexer object took quite some resources. Indeed, just changing a single line in SpiritParser.cpp saved 40% of execution time1 (~28s down to ~17s): lexer::Lexer lexer; into static const lexer::Lexer lexer; Now, making the grammar static involves … Read more

“Escape” and “Clobber” equivalent in MSVC

While I don’t know of an equivalent assembly trick for MSVC, Facebook uses the following in their Folly benchmark library: /** * Call doNotOptimizeAway(var) against variables that you use for * benchmarking but otherwise are useless. The compiler tends to do a * good job at eliminating unused variables, and this function fools * it … Read more

Benchmarking inside Java code

Yes it is possible to effectively implement performance benchmarks in java code. The important question is that any kind of performance benchmark is going to add its own overhead and how much of it do you want. System.currentMill..() is good enough benchmark for performance and in most of the cases nanoTime() is an overkill. For … Read more

How to speed up matrix multiplication in C++?

Speaking of speed-up, your function will be more cache-friendly if you swap the order of the k and j loop iterations: matrix mult_std(matrix a, matrix b) { matrix c(a.dim(), false, false); for (int i = 0; i < a.dim(); i++) for (int k = 0; k < a.dim(); k++) for (int j = 0; j … Read more

How to load data quickly into R?

how do numactl & perf change memory placement policy of child processes?

TL;DR: The default policy used by numactl can cause performances issues as well as the OpenMP thread binding. numactl constraints are applied to all (forked) children process. Indeed, numactl use a predefined policy by default. This policy is can be –interleaved, –preferred, –membind, –localalloc. This policy change the behavior of the operating system page allocation … Read more

Benchmarking – How to count number of instructions sent to CPU to find consumed MIPS

perf stat –all-user ./my_program on Linux will use CPU performance counters to record how many user-space instructions it ran, and how many core clock cycles it took. And how much CPU time it used, and will calculate average instructions per core clock cycle for you, e.g. 3,496,129,612 instructions:u # 2.61 insn per cycle It calculates … Read more

Timing CUDA operations

You could do something along the lines of : #include <sys/time.h> struct timeval t1, t2; gettimeofday(&t1, 0); kernel_call<<<dimGrid, dimBlock, 0>>>(); HANDLE_ERROR(cudaThreadSynchronize();) gettimeofday(&t2, 0); double time = (1000000.0*(t2.tv_sec-t1.tv_sec) + t2.tv_usec-t1.tv_usec)/1000.0; printf(“Time to generate: %3.1f ms \n”, time); or: float time; cudaEvent_t start, stop; HANDLE_ERROR( cudaEventCreate(&start) ); HANDLE_ERROR( cudaEventCreate(&stop) ); HANDLE_ERROR( cudaEventRecord(start, 0) ); kernel_call<<<dimGrid, dimBlock, 0>>>(); … Read more