What kind of variables consume registers in CUDA?

The register allocation in PTX is completely irrelevant to the final register consumption of the kernel. PTX is only an intermediate representation of the final machine code and uses static single assignment form, meaning that each register in PTX is only used once. A piece of PTX with hundreds of registers can compile into a kernel with only a few registers.

Register assignment is done by ptxas as a completely standalone compilation pass (either statically or just-in-time by the driver, or both) and it can perform a lot of code reordering and optimisations on the input PTX to improve throughput and conserve registers, meaning that there is little or no relationship between the variables in the original C or registers in PTX and the final register count of the assembled kernel.

nvcc does provide some ways to influence the register allocation behaviour of the assembler. You have __launch_bounds__ to provide heuristic hints to the compiler which can influence register allocation, and the compiler/assembler takes the -maxrregcount argument (at the potential expense of register spilling to local memory, which can lower performance). The volatile keyword used to make a difference to older versions of the nvopen64 based compiler and could influence the local memory spill behaviour. But you can’t arbitrarily control or steer register allocation in the original C code or PTX assembly language code.

Leave a Comment