Any particular function to initialize GPU other than the first cudaMalloc call?

A call to

cudaFree(0);

is the canonical way to force lazy context establishment in the CUDA runtime. You can’t reduce the overhead, that is a function of driver, runtime and operating system latencies. But the call above will let you control how/when those overheads occur during program execution.

EDIT in 2015 to add that the heuristics of context initialisation in the runtime API have subtly changed over time so that cudaSetDevice now establishes a context, so the cudaFree() call isn’t explicitly required to intialise a context, you can use cudaSetDeviceinstead. Also note that some set-up time will still be incurred at the first kernel launch, whereas before this wasn’t the case. For for kernel timing, it is best to include a warm-up call first before launching the kernel you will time to remove this set-up latency. It appears that the various profiling tools have enough granularity built in to avoid this without any extra API calls or kernel calls.

Leave a Comment