Implementing 64 bit atomic counter with 32 bit atomics

Question

This is a known pattern, called a SeqLock. https://en.wikipedia.org/wiki/Seqlock. (With the simplification that there’s only one writer so no extra support for excluding simultaneous writers is needed.)

You don’t need or want the increment of the counter variable itself to use atomic RMW operations. (Unless you’re on a system that can do that cheaply with no extra cost, then you don’t need a SeqLock). You can just load both halves with atomic 32-bit loads, increment it, and atomically store the result. (With cheap relaxed or release memory order, and using a release store for the 2nd counter update).

Similarly the counter also doesn’t need to be an atomic RMW.

The writer only needs pure loads and pure stores with only release ordering, which are (much) cheaper than atomic RMW, or stores with seq_cst ordering:

load the counter and the value in any order
store a new counter (old+1)
store the new value (or just update the low half if you want to branch on no carry)
store the final counter.

The ordering of the stores in those 3 bullet points are the only thing that matters. A write fence after the first store could be good, because we don’t really want the cost of making both stores of both halves of the value release, on CPUs where that’s more expensive than relaxed.

Unfortunately to satisfy C++ rules, the value has to be atomic<T>, which makes it inconvenient to get the compiler to generate the most efficient code possible for loading both halves. e.g. ARM ldp / stp load-pair might not be atomic, but that doesn’t matter. (And compilers often won’t optimize two separate atomic 32-bit loads into one wider load.)

Values other threads read while the sequence-counter is odd are irrelevant, but we’d like to avoid undefined behaviour. Maybe we could use a union of a volatile uint64_t and an atomic<uint64_t>

I wrote this C++ SeqLock<class T> template for another question I didn’t finish writing an answer for (figuring out which versions of ARM have 64-bit atomic load and store).

This tries to check if the target already supports lock-free atomic operations on atomic<T> to stop you from using this when it’s pointless. (Disable that for testing purposed by defining IGNORE_SIZECHECK.) TODO: transparently fall back to doing that, maybe with a template specialization, instead of using a static_assert.

I provided an inc() function for T that supports a ++ operator. TODO would be an apply() that accepts a lambda to do something to a T, and store the result between sequence counter updates.

// **UNTESTED**

#include <atomic>

#ifdef UNIPROCESSOR
// all readers and writers run on the same core (or same software thread)
// ordering instructions at compile time is all that's necessary
#define ATOMIC_FENCE std::atomic_signal_fence
#else
// A reader can be running on another core while writing.
// Memory barriers or ARMv8 acquire / release loads / store are needed
#define ATOMIC_FENCE std::atomic_thread_fence
#endif
// using fences instead of .store(std::memory_order_release) will stop the compiler
// from taking advantage of a release-store instruction instead of separate fence, like on AArch64
// But fences allow it to be optimized away to just compile-time ordering for the single thread or unirprocessor case.


// SINGLE WRITER only.
// uses volatile + barriers for the data itself, like pre-C++11
template <class T>
class SeqLocked
{
#ifndef IGNORE_SIZECHECK
    // sizeof(T) > sizeof(unsigned)
    static_assert(!std::atomic<T>::is_always_lock_free, "A Seq Lock with a type small enough to be atomic on its own is totally pointless, and we don't have a specialization that replaces it with a straight wrapper for atomic<T>");
#endif

       // C++17 doesn't have a good way to express a load that doesn't care about tearing
       //  without explicitly writing it as multiple small parts and thus gimping the compiler if it can use larger loads
    volatile T data;          // volatile should be fine on any implementation where pre-C++11 lockless code was possible with volatile,
                              //  even though Data Race UB does apply to volatile variables in ISO C++11 and later.
                            // even non-volatile normally works in practice, being ordered by compiler barriers.

    std::atomic<unsigned> seqcount{0};  // Even means valid, odd means modification in progress.
                                        //  unsigned definitely wraps around at a power of 2 on overflow

public:
    T get() const {
        unsigned c0, c1;
        T tmp;
        // READER RETRY LOOP
        do {
            c0 = seqcount.load(std::memory_order_acquire);     // or for your signal-handler use-case, relaxed load followed by ATOMIC_FENCE(std::memory_order_acquire);

            tmp = (T)data;       // load

            ATOMIC_FENCE(std::memory_order_acquire);  // LoadLoad barrier
            c1 = seqcount.load(std::memory_order_relaxed);
        } while(c0&1 || c0 != c1);     // retry if the counter changed or is odd

        return tmp;
    }

    // TODO: a version of this that takes a lambda for the operation on tmp
    T inc()     // WRITER
    {
        unsigned orig_count = seqcount.load(std::memory_order_relaxed);
        // we're the only writer, avoid an atomic RMW.

        seqcount.store(orig_count+1, std::memory_order_relaxed);
        ATOMIC_FENCE(std::memory_order_release);     // 2-way barrier *after* the store, not like a release store.  Or like making data=tmp a release operation.
        // make sure the counter becomes odd *before* any data change

        T tmp = data;  // load into a non-volatile temporary
        ++tmp;         // make any change to it
        data = tmp;    // store

        seqcount.store(orig_count+2, std::memory_order_release);  // or use ATOMIC_FENCE(std::memory_order_release); *before* this, so the UNIPROCESSOR case can just do compile-time ordering

        return tmp;
    }

    void set(T newval) {
        unsigned orig_count = seqcount.load(std::memory_order_relaxed);

        seqcount.store(orig_count+1, std::memory_order_relaxed);
        ATOMIC_FENCE(std::memory_order_release);
        // make sure the data stores appear after the first counter update.

        data = newval;    // store

        ATOMIC_FENCE(std::memory_order_release);
        seqcount.store(orig_count+2, std::memory_order_relaxed);  // Or use mo_release here, better on AArch64
    }

};


/***** test callers *******/
#include <stdint.h>

struct sixteenbyte {
    //unsigned arr[4];
    unsigned long  a,b,c,d;
    sixteenbyte() = default;
    sixteenbyte(const volatile sixteenbyte &old)
         : a(old.a), b(old.b), c(old.c), d(old.d) {}
    //arr(old.arr) {}
};

void test_inc(SeqLocked<uint64_t> &obj) {  obj.inc(); }
sixteenbyte test_get(SeqLocked<sixteenbyte> &obj) { return obj.get(); }
//void test_set(SeqLocked<sixteenbyte> &obj, sixteenbyte val) { obj.set(val); }

uint64_t test_get(SeqLocked<uint64_t> &obj) {
    return obj.get();
}

// void atomic_inc_u64_seq_cst(std::atomic<uint64_t> &a) { ++a; }
uint64_t u64_inc_relaxed(std::atomic<uint64_t> &a) {
    // same but without dmb barriers
    return 1 + a.fetch_add(1, std::memory_order_relaxed);
}

uint64_t u64_load_relaxed(std::atomic<uint64_t> &a) {
    // gcc uses LDREXD, not just LDRD?
    return a.load(std::memory_order_relaxed);
}

void u64_store_relaxed(std::atomic<uint64_t> &a, uint64_t val) {
    // gcc uses a LL/SC retry loop even for a pure store?
    a.store(val, std::memory_order_relaxed);
}

It compiles to the asm we want on the Godbolt compiler explorer for ARM, and other ISAs. At least for int64_t; larger struct types may be copied less efficiently because of cumbersome volatile rules.

It uses non-atomic volatile T data for the shared data. This is technically data-race undefined behaviour, but all compilers we use in practice were fine with pre-C++11 multi-threaded access to volatile objects. And pre-C++11, people even depended on atomicity for some sizes. We do not, we check the counter and only use the value we read if there were no concurrent writes. (That’s the whole point of a SeqLock.)

One problem with volatile T data is that in ISO C++, T foo = data won’t compile for struct objects unless you provide a copy-constructor from a volatile object, like

sixteenbyte(const volatile sixteenbyte &old)
         : a(old.a), b(old.b), c(old.c), d(old.d) {}

This is really annoying for us, because we don’t care about the details of how memory is read, just that multiple reads aren’t optimized into one.

volatile is really the wrong tool here, and plain T data with sufficient fencing to ensure that the read actually happens between the reads of the atomic counter would be better. e.g. we could do that in GNU C with a asm("":::"memory"); compiler barrier against reordering before/after the accesses. That would let the compiler copy larger objects with SIMD vectors, or whatever, which it won’t do with separate volatile accesses.

I think std::atomic_thread_fence(mo_acquire) would also be a sufficient barrier, but I’m not 100% sure.

In ISO C, you can copy a volatile aggregate (struct), and the compiler will emit whatever asm it normally would to copy that many bytes. But in C++, we can’t have nice things apparently.

More Related Contents:

Leave a Comment Cancel reply