Trouble launching CUDA kernels from static initialization code

The short version:

The underlying reason for the problem when class A is instantiated outside of main is that a particular hook routine which is required to initialise the CUDA runtime library with your kernels is not being run before the constructor of class A is being called. This happens because there are no guarantees about the order in which static objects are instantiated and initialised in the C++ execution model. Your global scope class is being instantiated before the global scope objects which do the CUDA setup are initialised. Your kernel code is never being loaded into the context before it is call, and a runtime error results.

As best as I can tell, this is a genuine limitation of the CUDA runtime API and not something easily fixed in user code. In your trivial example, you could replace the kernel call with a call to cudaMemset or one of the non-symbol based runtime API memset functions and it will work. This problem is completely limited to user kernels or device symbols loaded at runtime via the runtime API. For this reason, an empty default constructor would also solve your problem. From a design point of view, I would be very dubious of any pattern which calls kernels in the constructor. Adding a specific method for class GPU setup/teardown which doesn’t rely on the default constructor or destructor would be a much cleaner and less error prone design, IMHO.

In detail:

There is an internally generated routine (__cudaRegisterFatBinary) which must be run to load and register kernels, textures and statically defined device symbols contained in the fatbin payload of any runtime API program with the CUDA driver API before the kernel can be called without error. This is a part of the “lazy” context initialisation feature of the runtime API. You can confirm this for yourself as follows:

Here is a gdb trace of the revised example you posted. Note I insert a breakpoint into __cudaRegisterFatBinary, and that isn’t reached before your static A constructor is called and the kernel launch fails:

talonmies@box:~$ gdb a.out 
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/talonmies/a.out...done.
(gdb) break '__cudaRegisterFatBinary' 
Breakpoint 1 at 0x403180
(gdb) run
Starting program: /home/talonmies/a.out 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Scalar Field
[New Thread 0x7ffff5a63700 (LWP 10774)]
Class A
Kernel : invalid device function 
[Thread 0x7ffff5a63700 (LWP 10774) exited]
[Inferior 1 (process 10771) exited with code 0377]

Here is the same procedure, this time with A instantiation inside main (which is guaranteed to happen after the objects which perform lazy setup have been initialised):

talonmies@box:~$ cat main.cu
#include "classA.h"


int main() {
    A a_object;
    std::cout << "Main" << std::endl;
    return 0;
}

talonmies@box:~$ nvcc --keep -arch=sm_30 -g main.cu
talonmies@box:~$ gdb a.out 
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/talonmies/a.out...done.
(gdb) break '__cudaRegisterFatBinary' 
Breakpoint 1 at 0x403180
(gdb) run
Starting program: /home/talonmies/a.out 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, 0x0000000000403180 in __cudaRegisterFatBinary ()
(gdb) cont
Continuing.
Scalar Field
[New Thread 0x7ffff5a63700 (LWP 11084)]
Class A
Main
[Thread 0x7ffff5a63700 (LWP 11084) exited]
[Inferior 1 (process 11081) exited normally]

If this is really a crippling problem for you, I would suggest contacting NVIDIA developer support and raising a bug report.

Leave a Comment