Optimizing member variable order in C++

Two issues here:

  • Whether and when keeping certain fields together is an optimization.
  • How to do actually do it.

The reason that it might help, is that memory is loaded into the CPU cache in chunks called “cache lines”. This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.

The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It’s a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.

The general principle is called “locality of reference”. The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It’s often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don’t know what’s going to be in the cache, etc. But it’s possible to talk about what’s likely to happen, most of the time. If you want to know anything, you generally have to measure it.

Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they’re operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.

Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don’t have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:

  • an int (4 bytes, 4-aligned)
  • followed by a char (1 byte, any alignment)
  • followed by an int (4 bytes, 4-aligned)
  • followed by a char (1 byte, any alignment)

then chances are this will occupy 16 bytes in memory. The size and alignment of int isn’t the same on every platform, by the way, but 4 is very common and this is just an example.

In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object’s size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That’s all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.

I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it’s generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.

The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I’m trying to write portable code I might come up with this:

struct some_stuff {
    double d;   // I expect double is 64bit IEEE, it might not be
    uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know
    uint32_t i; // 4 bytes, usually 4-aligned
    int32_t j;  // same
    short s;    // usually 2 bytes, could be 2-aligned or unaligned, I don't know
    char c[4];  // array 4 chars, 4 bytes big but "never" needs 4-alignment
    char d;     // 1 byte, any alignment
};

If you don’t know the alignment of a field, or you’re writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it’s 8-aligned. Sometimes you’ll be wrong, but you’ll be right a lot of the time.

Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don’t have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it’s not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won’t make the code unclear, “put commonly-used fields at the start of the object” and “sort by alignment requirement” are two good rules.

Leave a Comment