Efficient 128-bit addition using carry flag

Actually gcc will use the carry automatically if you write your code carefully…

Current GCC can optimize hiWord += (loWord < loAdd); into add/adc (x86’s add-with-carry). This optimization was introduced in GCC5.3.

With separate uint64_t chunks in 64-bit mode: https://godbolt.org/z/S2kGRz.
And the same thing in 32-bit mode with uint32_t chunks: https://godbolt.org/z/9FC9vc

(editor’s note: Of course the hard part is writing a correct full-adder with carry in and carry out; that’s hard in C and GCC doesn’t know how to optimize any that I’ve seen.)

Also related: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html can give you carry-out from unsigned, or signed-overflow detection.

Older GCC, like GCC4.5, will branch or setc on the carry-out from an add, instead of using adc, and only used adc (add-with-carry) on the flag-result from an add if you used __int128. (Or uint64_t on a 32-bit target). See Is there a 128 bit integer in gcc? – only on 64-bit targets, supported since GCC4.1.

I compiled this code with gcc -O2 -Wall -Werror -S:

void increment128_1(unsigned long &hiWord, unsigned long &loWord)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    loWord += loAdd;
    if (loWord < loAdd) ++hiWord; // test_and_add_carry                                                                                                             
    hiWord += hiAdd;
}

void increment128_2(unsigned long &hiWord, unsigned long &loWord)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    loWord += loAdd;
    hiWord += hiAdd;
    hiWord += (loWord < loAdd); // test_and_add_carry                                                                                                               
}

This is the assembly for increment128_1:

.cfi_startproc
        movabsq     $-8801131483544218438, %rax
        addq        (%rsi), %rax
        movabsq     $-8801131483544218439, %rdx
        cmpq        %rdx, %rax
        movq        %rax, (%rsi)
        ja  .L5
        movq        (%rdi), %rax
        addq        $1, %rax
.L3:
        movabsq     $6794178679361, %rdx
        addq        %rdx, %rax
        movq        %rax, (%rdi)
        ret

…and this is the assembly for increment128_2:

        movabsq     $-8801131483544218438, %rax
        addq        %rax, (%rsi)
        movabsq     $6794178679361, %rax
        addq        (%rdi), %rax
        movabsq     $-8801131483544218439, %rdx
        movq        %rax, (%rdi)
        cmpq        %rdx, (%rsi)
        setbe       %dl
        movzbl      %dl, %edx
        leaq        (%rdx,%rax), %rax
        movq        %rax, (%rdi)
        ret

Note the lack of conditional branches in the second version.

[edit]

Also, references are often bad for performance, because GCC has to worry about aliasing… It is often better to just pass things by value. Consider:

struct my_uint128_t {
    unsigned long hi;
    unsigned long lo;
};

my_uint128_t increment128_3(my_uint128_t x)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    x.lo += loAdd;
    x.hi += hiAdd + (x.lo < loAdd);
    return x;
}

Assembly:

        .cfi_startproc
        movabsq     $-8801131483544218438, %rdx
        movabsq     $-8801131483544218439, %rax
        movabsq     $6794178679362, %rcx
        addq        %rsi, %rdx
        cmpq        %rdx, %rax
        sbbq        %rax, %rax
        addq        %rcx, %rax
        addq        %rdi, %rax
        ret

This is actually the tightest code of the three.

…OK so none of them actually used the carry automatically :-). But they do avoid the conditional branch, which I bet is the slow part (since the branch prediction logic will get it wrong half the time).

[edit 2]

And one more, which I stumbled across doing a little searching. Did you know GCC has built-in support for 128-bit integers?

typedef unsigned long my_uint128_t __attribute__ ((mode(TI)));

my_uint128_t increment128_4(my_uint128_t x)
{
    const my_uint128_t hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    return x + (hiAdd << 64) + loAdd;
}

The assembly for this one is about as good as it gets:

        .cfi_startproc
        movabsq     $-8801131483544218438, %rax
        movabsq     $6794178679361, %rdx
        pushq       %rbx
        .cfi_def_cfa_offset 16
        addq        %rdi, %rax
        adcq        %rsi, %rdx
        popq        %rbx
        .cfi_offset 3, -16
        .cfi_def_cfa_offset 8
        ret

(Not sure where the push/pop of ebx came from, but this is still not bad.)

All of these are with GCC 4.5.2, by the way.

More Related Contents:

Leave a Comment Cancel reply