How about this. Compiler is gcc 4.9.0 mingw x64:
void triad(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
intptr_t i;
__m256 k4 = _mm256_set1_ps(k);
for(i = -n; i < 0; i += 8) {
_mm256_store_ps(&z[i+n], _mm256_add_ps(_mm256_load_ps(&x[i+n]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i+n]))));
}
}
gcc -c -O3 -march=corei7 -mavx2 triad.c
0000000000000000 <triad>:
0: 44 89 c8 mov eax,r9d
3: f7 d8 neg eax
5: 48 98 cdqe
7: 48 85 c0 test rax,rax
a: 79 31 jns 3d <triad+0x3d>
c: c5 fc 28 0d 00 00 00 00 vmovaps ymm1,YMMWORD PTR [rip+0x0]
14: 4d 63 c9 movsxd r9,r9d
17: 49 c1 e1 02 shl r9,0x2
1b: 4c 01 ca add rdx,r9
1e: 4c 01 c9 add rcx,r9
21: 4d 01 c8 add r8,r9
24: c5 f4 59 04 82 vmulps ymm0,ymm1,YMMWORD PTR [rdx+rax*4]
29: c5 fc 58 04 81 vaddps ymm0,ymm0,YMMWORD PTR [rcx+rax*4]
2e: c4 c1 7c 29 04 80 vmovaps YMMWORD PTR [r8+rax*4],ymm0
34: 48 83 c0 08 add rax,0x8
38: 78 ea js 24 <triad+0x24>
3a: c5 f8 77 vzeroupper
3d: c3 ret
Like your hand written code, gcc is using 5 instructions for the loop. The gcc code uses scale=4 where yours uses scale=1. I was able to get gcc to use scale=1 with a 5 instruction loop, but the C code is awkward and 2 of the AVX instructions in the loop grow from 5 bytes to 6 bytes.