In a CUDA kernel, how do I store an array in “local thread memory”?

Question

Arrays, local memory and registers

There is a misconception here regarding the definition of “local memory”. “Local memory” in CUDA is actually global memory (and should really be called “thread-local global memory”) with interleaved addressing (which makes iterating over an array in parallel a bit faster than having each thread’s data blocked together). If you want things to be really fast, you want to use either shared memory, or, better yet, registers (especially on the latest devices where you get up to 255 registers per thread). Explaining the entire CUDA memory hierarchy falls out of scope of this post. Let us instead focus on making small array computations fast.

Small arrays, just like variables can be stored entirely in registers. On current NVIDIA hardware, however, putting arrays into registers is difficult. Why? Because registers need very careful treatment. If you don’t do it exactly right, your data will end up in local memory (which, again, is really global memory, which is the slowest memory you have). The CUDA Programming Guide, section 5.3.2 tells you when local memory is used:

Local Memory

Local memory accesses only occur for some automatic variables as mentioned in Variable Type Qualifiers. Automatic variables that the compiler is likely to place in local memory are:

Arrays for which it cannot determine that they are indexed with constant quantities,

Large structures or arrays that would consume too much register space,

Any variable if the kernel uses more registers than available (this is also known as register spilling).

How does register allocation work?

Note that register allocation is an extremely complicated process which is why you cannot (and should not) interfere with it. Instead, the compiler will convert CUDA code into PTX code (a sort of bytecode) which assumes a machine with infinitely many registers. You can write inline PTX but it won’t do too much to register allocation. PTX code is device-independent code and it is only the first stage. In a second stage, PTX will be compiled into device assembly code, called SASS. SASS code has the actual register allocations. The SASS compiler and it’s optimizer will also be the ultimate authority on whether a variable will be in registers or local memory. All you can do is try to understand what the SASS compiler does in certain cases and use that for your advantage. Code correlation view in Nsight can help you with that (see below). However, since the compiler and optimizer keep changing, there is no guarantees as to what will or will not be in registers.

Insufficient registers

Appendix G, section 1 tells you how many registers a thread can have. Look for “Maximum number of 32-bit registers per thread”. In order to interpret that table, you must know your compute capability (see below). Don’t forget that registers are used for all kinds of things, and don’t just correlate to single variables. Registers on all devices up to CC 3.5 are 32 bit each. If the compiler is smart enough (and the CUDA compiler keeps changing), it can for example pack multiple bytes into the same register. The Nsight code correlation view (see “Analyzing Memory Accesses” below) also reveals that.

Constant vs. Dynamic Indexing

While the space constraint is an obvious hurdle to in-register arrays, the thing that is easily overseen is the fact that, on current hardware (Compute Capability 3.x and below), the compiler places any array in local memory that is accessed with dynamic indexing. A dynamic index is an index which the compiler cannot figure out. Arrays accessed with dynamic indices can’t be placed in registers because registers must be determined by the compiler, and thus the actual register being used must not depend on a value determined at run-time. For example, given an array arr, arr[k] is constant indexing if and only if k is a constant, or only depends on constants. If k, in any way, depends on some non-constant value, the compiler cannot compute the value of k and you got dynamic indexing. In loops where k starts and ends at a (small) constant numbers, the compiler (most probably) can unroll your loop, and can still achieve constant indexing.

Example

For example, sorting a small array can be done in registers but you must use sorting networks or similarly “hard-wired” approaches, and can’t just use a standard algorithm because most algorithms use dynamic indexing.

With quite a high probability, in the following code example, the compiler keeps the entire aBytes array in registers because it is not too large and the loops can fully be unrolled (because the loop iterates over a constant range). The compiler (very probably) knows which register is being accessed at every step and can thus keep it fully in registers. Keep in mind that there are no guarantees. The best you can do is to verify it on a case-by-case basis using CUDA developer tools, as described below.

__global__
void
testSortingNetwork4(const char * aInput, char * aResult)
{
    const int NBytes = 4;

    char aBytes[NBytes];

    // copy input to local array
    for (int i = 0; i < NBytes; ++i)
    {
        aBytes[i] = aInput[i];
    }

    // sort using sorting network
    CompareAndSwap(aBytes, 0, 2); CompareAndSwap(aBytes, 1, 3); 
    CompareAndSwap(aBytes, 0, 1); CompareAndSwap(aBytes, 2, 3); 
    CompareAndSwap(aBytes, 1, 2); 


    // copy back to result array
    for (int i = 0; i < NBytes; ++i)
    {
        aResult[i] = aBytes[i];
    }
}

Analyzing memory accesses

Once you are done, you generally want to verify whether the data is actually stored in registers or whether it went to local memory. The first thing you can do is to tell your compiler to give you memory statistics using the --ptxas-options=-v flag. A more detailed way of analyzing memory accesses is using Nsight.

Nsight has many cool features. Nsight for Visual Studio has a built-in profiler and a CUDA <-> SASS code correlation view. The feature is explained here. Note that Nsight versions for different IDEs are probably developed independently, and thus their features might vary between the different implementations.

If you follow the instructions in above link (make sure to add the corresponding flags when compiling!), you can find the “CUDA Memory Transactions” button at the very bottom of the lower menu. In that view, you want to find that there is no memory transaction coming from the lines that are only working on the corresponding array (e.g. the CompareAndSwap lines in my code example). Because if it does not report any memory access for those lines, you (very probably) were able to keep the entire computation in registers and might just have gained a speed up of thousands, if not tenthousands, of percent (You might also want to check the actual speed gain, you get out of this!).

Figuring out Compute Capability

In order to figure out how many registers you have, you need to know your device’s compute capability. The standard way of getting such device information is running the deviceQuery sample.

(Update – as mentioned by paleonix in the comments) deviceQuery is part of the official cuda-samples repo. You can find it here.

If you have Nsight for Visual Studio, just go to Nsight -> Windows -> System Info.

Don’t optimize early

I am sharing this today because I came across this very problem very recently. However, as mentioned in this thread, forcing data to be in registers is definitely not the first step you want to take. First, make sure that you actually understand what is going on, then approach the problem step by step. Looking at the assembly code is certainly a good step, but it should generally not be your first. If you are new to CUDA, the CUDA Best Practices Guide will help you figure out some of those steps.