Produce loops without cmp instruction in GCC

How about this. Compiler is gcc 4.9.0 mingw x64:

void triad(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    intptr_t i;
    __m256 k4 = _mm256_set1_ps(k);

    for(i = -n; i < 0; i += 8) {
        _mm256_store_ps(&z[i+n], _mm256_add_ps(_mm256_load_ps(&x[i+n]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i+n]))));
    }
}

gcc -c -O3 -march=corei7 -mavx2 triad.c

0000000000000000 <triad>:
   0:   44 89 c8                mov    eax,r9d
   3:   f7 d8                   neg    eax
   5:   48 98                   cdqe
   7:   48 85 c0                test   rax,rax
   a:   79 31                   jns    3d <triad+0x3d>
   c:   c5 fc 28 0d 00 00 00 00 vmovaps ymm1,YMMWORD PTR [rip+0x0]
  14:   4d 63 c9                movsxd r9,r9d
  17:   49 c1 e1 02             shl    r9,0x2
  1b:   4c 01 ca                add    rdx,r9
  1e:   4c 01 c9                add    rcx,r9
  21:   4d 01 c8                add    r8,r9

  24:   c5 f4 59 04 82          vmulps ymm0,ymm1,YMMWORD PTR [rdx+rax*4]
  29:   c5 fc 58 04 81          vaddps ymm0,ymm0,YMMWORD PTR [rcx+rax*4]
  2e:   c4 c1 7c 29 04 80       vmovaps YMMWORD PTR [r8+rax*4],ymm0
  34:   48 83 c0 08             add    rax,0x8
  38:   78 ea                   js     24 <triad+0x24>

  3a:   c5 f8 77                vzeroupper
  3d:   c3                      ret

Like your hand written code, gcc is using 5 instructions for the loop. The gcc code uses scale=4 where yours uses scale=1. I was able to get gcc to use scale=1 with a 5 instruction loop, but the C code is awkward and 2 of the AVX instructions in the loop grow from 5 bytes to 6 bytes.

Leave a Comment