cuda - w3toppers.com

How to profile PyCuda code with the Visual Profiler?

There is something wrong with the way you are specifying the executable to the compute profiler. If I put a hash bang line at the top of your posted code: #!/usr/bin/env python and then give the python file executable permissions, the compute profiler runs the code without complaint and I get this:

Why does my CUDA kernel crash (unspecified launch failure) with a different dataset size?

Solved the problem. Turns out the WDDM TDR (timeout detecion recovery) was enabled and the delay was set to 2 seconds. This means that if the kernel execution time exceeds 2s, the driver will crash and recover. This is applicable to graphics and rendering (for general purpose uses of the GPU). In this case however, … Read more

Using maximum shared memory in Cuda

from here: Compute capability 7.x devices allow a single thread block to address the full capacity of shared memory: 96 KB on Volta, 64 KB on Turing. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays) and require … Read more

CUBLAS: Incorrect inversion for matrix with zero pivot

There seems to be a bug in the current CUBLAS library implementation of cublas<t>getrfBatched for matrices of dimension (n) such that 3<=n<=16, when there is a “zero pivot” as you say. A possible workaround is to “identity-extend” your A matrix to be inverted, when n<17, to a size of 17×17 (using matlab nomenclature): LU = … Read more

How to create a CUDA context?

The canonical way to force runtime API context establishment is to call cudaFree(0). If you have multiple devices, call cudaSetDevice() with the ID of the device you want to establish a context on, then cudaFree(0) to establish the context. EDIT: Note that as of CUDA 5.0, it appears that the heuristics of context establishment are … Read more

How to implement device side CUDA virtual functions?

The most important part of Robert Crovella’s comment is: The objects simply need to be created on the device. So keeping that in mind, I was dealing with situation where I had an abstract class Function and then some implementations of it encapsulating different function and its evaluation. This is the simplified version of my … Read more

cudaMemset() – does it set bytes or integers?

The documentation is correct, and your interpretation of what cudaMemset does is wrong. The function really does set byte values. Your example sets the first 32 bytes to 0x12, not all 32 integers to 0x12, viz: #include <cstdio> int main(void) { const int n = 32; const size_t sz = size_t(n) * sizeof(int); int *dJunk; … Read more

why do we need cudaDeviceSynchronize(); in kernels with device-printf?

A kernel launch is asynchronous. This means it returns control to the CPU thread immediately after starting up the GPU process, before the kernel has finished executing. So what is the next thing in the CPU thread here? Application exit. At application exit, it’s ability to send output to the standard output is terminated by … Read more

Timing CUDA operations

You could do something along the lines of : #include <sys/time.h> struct timeval t1, t2; gettimeofday(&t1, 0); kernel_call<<<dimGrid, dimBlock, 0>>>(); HANDLE_ERROR(cudaThreadSynchronize();) gettimeofday(&t2, 0); double time = (1000000.0*(t2.tv_sec-t1.tv_sec) + t2.tv_usec-t1.tv_usec)/1000.0; printf(“Time to generate: %3.1f ms \n”, time); or: float time; cudaEvent_t start, stop; HANDLE_ERROR( cudaEventCreate(&start) ); HANDLE_ERROR( cudaEventCreate(&stop) ); HANDLE_ERROR( cudaEventRecord(start, 0) ); kernel_call<<<dimGrid, dimBlock, 0>>>(); … Read more

In a CUDA kernel, how do I store an array in “local thread memory”?

Arrays, local memory and registers There is a misconception here regarding the definition of “local memory”. “Local memory” in CUDA is actually global memory (and should really be called “thread-local global memory”) with interleaved addressing (which makes iterating over an array in parallel a bit faster than having each thread’s data blocked together). If you … Read more