Fastest way to do horizontal SSE vector sum (or other reduction)

In general for any kind of vector horizontal reduction, extract / shuffle high half to line up with low, then vertical add (or min/max/or/and/xor/multiply/whatever); repeat until a there’s just a single element (with high garbage in the rest of the vector). If you start with vectors wider than 128-bit, narrow in half until you get … Read more