Why are loops always compiled into “do…while” style (tail jump)?

Related: asm loop basics: While, Do While, For loops in Assembly Language (emu8086) Fewer instructions / uops inside the loop = better. Structuring the code outside the loop to achieve this is very often a good idea. Sometimes this requires “loop rotation” (peeling part of the first iteration so the actual loop body has the … Read more

Fastest way to do horizontal SSE vector sum (or other reduction)

In general for any kind of vector horizontal reduction, extract / shuffle high half to line up with low, then vertical add (or min/max/or/and/xor/multiply/whatever); repeat until a there’s just a single element (with high garbage in the rest of the vector). If you start with vectors wider than 128-bit, narrow in half until you get … Read more