Using maximum shared memory in Cuda

from here: Compute capability 7.x devices allow a single thread block to address the full capacity of shared memory: 96 KB on Volta, 64 KB on Turing. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays) and require … Read more

How to create a CUDA context?

The canonical way to force runtime API context establishment is to call cudaFree(0). If you have multiple devices, call cudaSetDevice() with the ID of the device you want to establish a context on, then cudaFree(0) to establish the context. EDIT: Note that as of CUDA 5.0, it appears that the heuristics of context establishment are … Read more

cudaMemset() – does it set bytes or integers?

The documentation is correct, and your interpretation of what cudaMemset does is wrong. The function really does set byte values. Your example sets the first 32 bytes to 0x12, not all 32 integers to 0x12, viz: #include <cstdio> int main(void) { const int n = 32; const size_t sz = size_t(n) * sizeof(int); int *dJunk; … Read more

Timing CUDA operations

You could do something along the lines of : #include <sys/time.h> struct timeval t1, t2; gettimeofday(&t1, 0); kernel_call<<<dimGrid, dimBlock, 0>>>(); HANDLE_ERROR(cudaThreadSynchronize();) gettimeofday(&t2, 0); double time = (1000000.0*(t2.tv_sec-t1.tv_sec) + t2.tv_usec-t1.tv_usec)/1000.0; printf(“Time to generate: %3.1f ms \n”, time); or: float time; cudaEvent_t start, stop; HANDLE_ERROR( cudaEventCreate(&start) ); HANDLE_ERROR( cudaEventCreate(&stop) ); HANDLE_ERROR( cudaEventRecord(start, 0) ); kernel_call<<<dimGrid, dimBlock, 0>>>(); … Read more

In a CUDA kernel, how do I store an array in “local thread memory”?

Arrays, local memory and registers There is a misconception here regarding the definition of “local memory”. “Local memory” in CUDA is actually global memory (and should really be called “thread-local global memory”) with interleaved addressing (which makes iterating over an array in parallel a bit faster than having each thread’s data blocked together). If you … Read more