Most accurate way to do a combined multiply-and-divide operation in 64-bit?

Since this is tagged Visual C++ I’ll give a solution that abuses MSVC-specific intrinsics.

This example is fairly complicated. It’s a highly simplified version of the same algorithm that is used by GMP and java.math.BigInteger for large division.

Although I have a simpler algorithm in mind, it’s probably about 30x slower.

This solution has the following constraints/behavior:

  • It requires x64. It will not compile on x86.
  • The quotient is not zero.
  • The quotient saturates if it overflows 64-bits.

Note that this is for the unsigned integer case. It’s trivial to build a wrapper around this to make it work for signed cases as well. This example should also produce correctly truncated results.

This code is not fully tested. However, it has passed all the tests cases that I’ve thrown at it.
(Even cases that I’ve intentionally constructed to try to break the algorithm.)

#include <intrin.h>

uint64_t muldiv2(uint64_t a, uint64_t b, uint64_t c){
    //  Normalize divisor
    unsigned long shift;
    _BitScanReverse64(&shift,c);
    shift = 63 - shift;

    c <<= shift;

    //  Multiply
    a = _umul128(a,b,&b);
    if (((b << shift) >> shift) != b){
        cout << "Overflow" << endl;
        return 0xffffffffffffffff;
    }
    b = __shiftleft128(a,b,shift);
    a <<= shift;


    uint32_t div;
    uint32_t q0,q1;
    uint64_t t0,t1;

    //  1st Reduction
    div = (uint32_t)(c >> 32);
    t0 = b / div;
    if (t0 > 0xffffffff)
        t0 = 0xffffffff;
    q1 = (uint32_t)t0;
    while (1){
        t0 = _umul128(c,(uint64_t)q1 << 32,&t1);
        if (t1 < b || (t1 == b && t0 <= a))
            break;
        q1--;
//        cout << "correction 0" << endl;
    }
    b -= t1;
    if (t0 > a) b--;
    a -= t0;

    if (b > 0xffffffff){
        cout << "Overflow" << endl;
        return 0xffffffffffffffff;
    }

    //  2nd reduction
    t0 = ((b << 32) | (a >> 32)) / div;
    if (t0 > 0xffffffff)
        t0 = 0xffffffff;
    q0 = (uint32_t)t0;

    while (1){
        t0 = _umul128(c,q0,&t1);
        if (t1 < b || (t1 == b && t0 <= a))
            break;
        q0--;
//        cout << "correction 1" << endl;
    }

//    //  (a - t0) gives the modulus.
//    a -= t0;

    return ((uint64_t)q1 << 32) | q0;
}

Note that if you don’t need a perfectly truncated result, you can remove the last loop completely. If you do this, the answer will be no more than 2 larger than the correct quotient.

Test Cases:

cout << muldiv2(4984198405165151231,6132198419878046132,9156498145135109843) << endl;
cout << muldiv2(11540173641653250113, 10150593219136339683, 13592284235543989460) << endl;
cout << muldiv2(449033535071450778, 3155170653582908051, 4945421831474875872) << endl;
cout << muldiv2(303601908757, 829267376026, 659820219978) << endl;
cout << muldiv2(449033535071450778, 829267376026, 659820219978) << endl;
cout << muldiv2(1234568, 829267376026, 1) << endl;
cout << muldiv2(6991754535226557229, 7798003721120799096, 4923601287520449332) << endl;
cout << muldiv2(9223372036854775808, 2147483648, 18446744073709551615) << endl;
cout << muldiv2(9223372032559808512, 9223372036854775807, 9223372036854775807) << endl;
cout << muldiv2(9223372032559808512, 9223372036854775807, 12) << endl;
cout << muldiv2(18446744073709551615, 18446744073709551615, 9223372036854775808) << endl;

Output:

3337967539561099935
8618095846487663363
286482625873293138
381569328444
564348969767547451
1023786965885666768
11073546515850664288
1073741824
9223372032559808512
Overflow
18446744073709551615
Overflow
18446744073709551615

Leave a Comment