Why does this code execute more slowly after strength-reducing multiplications to loop-carried additions?

The key to understanding the performance differences you’re seeing is in vectorization. Yes, the addition-based solution has a mere two instructions in its inner loop, but the important difference is not in how many instructions there are in the loop, but in how much work each instruction is performing.

In the first version, the output is purely dependent on the input: Each data[i] is a function just of i itself, which means that each data[i] can be computed in any order: The compiler can do them forwards, backwards, sideways, whatever, and you’ll still get the same result — unless you’re observing that memory from another thread, you’ll never notice which way the data is being crunched.

In the second version, the output isn’t dependent on i — it’s dependent on the A and Z from the last time around the loop.

If we were to represent the bodies of these loops as little mathematical functions, they’d have very different overall forms:

  • f(i) -> di
  • f(Y, Z) -> (di, Y’, Z’)

In the latter form, there’s no actual dependency on i — the only way you can compute the value of the function is by knowing the previous Y and Z from the last invocation of the function, which means that the functions form a chain — you can’t do the next one until you’ve done the previous one.

Why does that matter? Because the CPU has vector parallel instructions that each can perform two, four, or even eight arithmetic operations at the same time! (AVX CPUs can do even more in parallel.) That’s four multiplies, four adds, four subtracts, four comparisons — four whatevers! So if the output you’re trying to compute is only dependent on the input, then you can safely do two, four, or even eight at a time — it doesn’t matter if they’re forward or backward, since the result is the same. But if the output is dependent on previous computation, then you’re stuck doing it in serial form — one at a time.

That’s why the “longer” code wins for performance. Even though it has a lot more setup, and it’s actually doing a lot more work, most of that work is being done in parallel: It’s not computing just data[i] in each iteration of the loop — it’s computing data[i], data[i+1], data[i+2], and data[i+3] at the same time, and then jumping to the next set of four.

To expand out a little what I mean here, the compiler first turned the original code into something like this:

int i;
for (i = 0; i < LEN; i += 4) {
    data[i+0] = A*(i+0)*(i+0) + B*(i+0) + C;
    data[i+1] = A*(i+1)*(i+1) + B*(i+1) + C;
    data[i+2] = A*(i+2)*(i+2) + B*(i+2) + C;
    data[i+3] = A*(i+3)*(i+3) + B*(i+3) + C;
}

You can convince yourself that’ll do the same thing as the original, if you squint at it. It did that because of all of those identical vertical lines of operators: All of those * and + operations are the same operation, just being performed on different data — and the CPU has special built-in instructions that can perform multiple * or multiple + operations on different data at the same time, in a mere single clock cycle each.

Notice the letter p in the instructions in the faster solution — addpd and mulpd — and the letter s in the instructions in the slower solution — addsd. That’s “Add Packed Doubles” and “Multiply Packed Doubles,” versus “Add Single Double.”

Not only that, it looks like the compiler partially unrolled the loop too — the loop doesn’t just do two values each iteration, but actually four, and interleaved the operations to avoid dependencies and stalls, all of which cuts down on the number of times that the assembly code has to test i < 1000 as well.

All of this only works, though, if there are no dependencies between iterations of the loop: If the only thing that determines what happens for each data[i] is i itself. If there are dependencies, if data from the last iteration influences the next one, then the compiler may be so constrained by them that it can’t alter the code at all — instead of the compiler being able to use fancy parallel instructions or clever optimizations (CSE, strength reduction, loop unrolling, reordering, et al.), you get out code that’s exactly what you put in — add Y, then add Z, then repeat.

But here, in the first version of the code, the compiler correctly recognized that there were no dependencies in the data, and figured out that it could do the work in parallel, and so it did, and that’s what makes all the difference.

Leave a Comment