Large Performance difference when summing ints vs floats in Cython vs NumPy

I’m not going to answer all your questions, but only (in my eyes) the most interesting ones.

Let start with your counting example:

the compiler is able to optimize the for-loop in the integer-case – the resulting binary doesn’t not calculate anything – it only hast to return the value precalculated during the compilation phase.
This is not the case for the double-case, because due to rounding errors, the result will not be 1.0*10**6 and because cython compiles in IEEE 754 (not -ffast-math) mode per default.

This you must keep it mind when looking at your cython-code: the compiler is not allowed to rearrange the summations (IEEE 754) and because the result of the first summations is needed for the next there is only one long line in which all operations wait.

But the most crucial insight: the numpy is not doing the same as your cython code:

>>> sum_float(a_float)-a_float.sum()
2.9103830456733704e-08

Yep, nobody told numpy (differently from your cython code) that the sum has to be calculated like this

((((a_1+a2)+a3)+a4)+...

And numpy take advantage of it in two ways:

it performs pairwise summation (kind of), which leads to a smaller rounding error.
it calculates sum in chunks (the code of python is somewhat hard to understand, here is the corresponding template and further down the listing of the used function pairwise_sum_DOUBLE)

The second point is the reason for the speed-up your are observing, the calculation happens similar to the following schema (at least what I understand from the source code below):

a1  + a9 + .....  = r1 
a2  + a10 + ..... = r2
..
a8  + a16 +       = r8

----> sum=r1+....+r8

The advantage of this kind of summation: the result of a2+a10 doesn’t depend on a1+a9 and these both values can be calculated simultaneously (e.g. pipelining) on modern CPUs, which leads to the speed-up you are observing.

For what it is worth, on my machine the cython-integer-sum is slower than numpy’s.

The need of taking the stride of the numpy-array into account (which is only known at the run-time, see also this question about vectorization) prevents some optimizations. A workaround is to use memory-views, for which you can make clear, that data is continuous, i.e.:

def sum_int_cont(np.int64_t[::1] a):

Which leads on my machine to a significant speed-up (factor 2):

%timeit sum_int(a_int)
2.64 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit sum_int_cont(a_int)
1.31 ms ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit a_int.sum()
2.1 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It’s true, that in this case using memory-views for doubles doesn’t bring any speed-up (don’t know why), but in general it simplifies the life of the optimizer. For example combining the memory-view-variant with -ffast-math compile options, which would allow the associativity, leads to a performance comparable with numpy:

%%cython -c=-ffast-math
cimport numpy as np
def sum_float_cont(np.float64_t[::1] a):
    cdef:
        Py_ssize_t i, n = len(a)
        np.float64_t total = 0

    for i in range(n):
        total += a[i]
    return total

And now:

>>> %timeit sum_float(a_float)
3.46 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit sum_float_cont(a_float)
1.87 ms ± 44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit a_float.sum()
1.41 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Listing of pairwise_sum_DOUBLE:

/*
 * Pairwise summation, rounding error O(lg n) instead of O(n).
 * The recursion depth is O(lg n) as well.
 * when updating also update similar complex floats summation
 */
static npy_double
pairwise_sum_DOUBLE(npy_double *a, npy_uintp n, npy_intp stride)
{
    if (n < 8) {
        npy_intp i;
        npy_double res = 0.;
        for (i = 0; i < n; i++) {
            res += (a[i * stride]);
        }
        return res;
    }
    else if (n <= PW_BLOCKSIZE) {
        npy_intp i;
        npy_double r[8], res;

        /*
         * sum a block with 8 accumulators
         * 8 times unroll reduces blocksize to 16 and allows vectorization with
         * avx without changing summation ordering
         */
        r[0] = (a[0 * stride]);
        r[1] = (a[1 * stride]);
        r[2] = (a[2 * stride]);
        r[3] = (a[3 * stride]);
        r[4] = (a[4 * stride]);
        r[5] = (a[5 * stride]);
        r[6] = (a[6 * stride]);
        r[7] = (a[7 * stride]);

        for (i = 8; i < n - (n % 8); i += 8) {
            r[0] += (a[(i + 0) * stride]);
            r[1] += (a[(i + 1) * stride]);
            r[2] += (a[(i + 2) * stride]);
            r[3] += (a[(i + 3) * stride]);
            r[4] += (a[(i + 4) * stride]);
            r[5] += (a[(i + 5) * stride]);
            r[6] += (a[(i + 6) * stride]);
            r[7] += (a[(i + 7) * stride]);
        }

        /* accumulate now to avoid stack spills for single peel loop */
        res = ((r[0] + r[1]) + (r[2] + r[3])) +
              ((r[4] + r[5]) + (r[6] + r[7]));

        /* do non multiple of 8 rest */
        for (; i < n; i++) {
            res += (a[i * stride]);
        }
        return res;
    }
    else {
        /* divide by two but avoid non-multiples of unroll factor */
        npy_uintp n2 = n / 2;
        n2 -= n2 % 8;
        return pairwise_sum_DOUBLE(a, n2, stride) +
               pairwise_sum_DOUBLE(a + n2 * stride, n - n2, stride);
    }
}

More Related Contents:

Leave a Comment Cancel reply