AVX2: Computing dot product of 512 float arrays
_mm256_dp_ps is only useful for dot-products of 2 to 4 elements; for longer vectors use vertical SIMD in a loop and reduce to scalar at the end. Using _mm256_dp_ps and _mm256_add_ps in a loop would be much slower. GCC and clang require you to enable (with command line options) ISA extensions that you use intrinsics … Read more