SSE reduction of float vector

Typically you generate 4 partial sums in your loop and then just sum horizontally across the 4 elements after the loop, e.g. #include <cassert> #include <cstdint> #include <emmintrin.h> float vsum(const float *a, int n) { float sum; __m128 vsum = _mm_set1_ps(0.0f); assert((n & 3) == 0); assert(((uintptr_t)a & 15) == 0); for (int i = … Read more