micro-optimization - w3toppers.com

DateTime.DayOfWeek micro optimization

Let’s do some tunning. Prime factorization of TimeSpan.TicksPerDay (864000000000) : DayOfWeek now can be expressed as: public DayOfWeek DayOfWeek { get { return (DayOfWeek)(((Ticks>>14) / 52734375 + 1L) % 7L); } } And we are working in modulo 7, 52734375 % 7 it’s 1. So, the code above is equal to: public static DayOfWeek dayOfWeekTurbo(this … Read more

Cycles/cost for L1 Cache hit vs. Register on x86?

Here’s a great article on the subject: http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1 To answer your question – yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly 😉 PS: The specifics will vary, but this link has some good ballpark figures: Approximate cost to access various caches … Read more

Fastest way to strip all non-printable characters from a Java String

using 1 char array could work a bit better int length = s.length(); char[] oldChars = new char[length]; s.getChars(0, length, oldChars, 0); int newLen = 0; for (int j = 0; j < length; j++) { char ch = oldChars[j]; if (ch >= ‘ ‘) { oldChars[newLen] = ch; newLen++; } } s = new … Read more

Micro Optimization of a 4-bucket histogram of a large array or list

This should be possible at about 8 elements (1 AVX2 vector) per 2.5 clock cycles or so (per core) on a modern x86-64 like Skylake or Zen 2, using AVX2. Or per 2 clocks with unrolling. Or on your Piledriver CPU, maybe 1x 16-byte vector of indexes per 3 clocks with AVX1 _mm_cmpeq_epi32. The general … Read more

How to force NASM to encode [1 + rax2] as disp32 + index2 instead of disp8 + base + index?

NOSPLIT: Similarly, NASM will split [eax*2] into [eax+eax] because that allows the offset field to be absent and space to be saved; in fact, it will also split [eax*2+offset] into [eax+eax+offset]. You can combat this behaviour by the use of the NOSPLIT keyword: [nosplit eax*2] will force [eax*2+0] to be generated literally. [nosplit eax*1] also … Read more

Modern x86 cost model

The best reference is the Intel Optimization Manual, which provides fairly detailed information on architectural hazards and instruction latencies for all recent Intel cores, as well as a good number of optimization examples. Another excellent reference is Agner Fog’s optimization resources, which have the virtue of also covering AMD cores. Note that specific cost models … Read more

Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?

Tl;DR: For these three cases, a penalty of a few cycles is incurred when performing a load and store at the same time. The load latency is on the critical path in all of the three cases, but the penalty is different in different cases. Case 3 is about a cycle higher than case 1 … Read more

latency vs throughput in intel intrinsics

For a much more complete picture of CPU performance, see Agner Fog’s microarchitecture guide and instruction tables. (Also his Optimizing C++ and Optimizing Assembly guides are excellent). See also other links in the x86 tag wiki, especially Intel’s optimization manual. See also How many CPU cycles are needed for each assembly instruction? and What considerations … Read more

Latency bounds and throughput bounds for processors for operations that must occur in sequence

Terminology: you can say a loop is “bound on latency”, but when analyzing that bottleneck I wouldn’t say “the latency bound” or “bounds”. That sounds wrong to me. The thing you’re measuring (or calculating via static performance analysis) is the latency or length of the critical path, or the length of the loop-carried dependency chain. … Read more

What are the costs of failed store-to-load forwarding on x86?

It is not really a full answer, but still evidence that the penalty is visible. MSVC 2022 benchmark, compiler with /std:c++latest. #include <chrono> #include <iostream> struct alignas(16) S { char* a; int* b; }; extern “C” void init_fused_copy_unfused(int n, S & s2, S & s1); extern “C” void init_fused_copy_fused(int n, S & s2, S & … Read more