cuda - w3toppers.com

What is the purpose of using multiple “arch” flags in Nvidia’s NVCC compiler?

Roughly speaking, the code compilation flow goes like this: CUDA C/C++ device code source –> PTX –> SASS The virtual architecture (e.g. compute_20, whatever is specified by -arch compute…) determines what type of PTX code will be generated. The additional switches (e.g. -code sm_21) determine what type of SASS code will be generated. SASS is … Read more

Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation) [closed]

Hardware If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn). Software threads are organized in blocks. A block is executed … Read more

How to verify CuDNN installation?

The installation of CuDNN is just copying some files. Hence to check if CuDNN is installed (and which version you have), you only need to check those files. Install CuDNN Step 1: Register an nvidia developer account and download cudnn here (about 80 MB). You might need nvcc –version to get your cuda version. Step … Read more

How do CUDA blocks/warps/threads map onto CUDA cores?

Two of the best references are NVIDIA Fermi Compute Architecture Whitepaper GF104 Reviews I’ll try to answer each of your questions. The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a … Read more

Passing Host Function as a function pointer in global OR device function in CUDA

Yes, for a GPU implementation of Calc, you should pass the GetInv as a __device__ function pointer. It is possible, here are some worked examples: Ex. 1 Ex. 2 Ex. 3 Most of the above examples demonstrate bringing the device function pointer all the way back to the host code. This may not be necessary … Read more

How to measure the inner kernel time in NVIDIA CUDA?

You can do something like this: __global__ void kernelSample(int *runtime) { // …. clock_t start_time = clock(); //some code here clock_t stop_time = clock(); // …. runtime[tidx] = (int)(stop_time – start_time); } Which gives the number of clock cycles between the two calls. Be a little careful though, the timer will overflow after a couple … Read more

Using GPU from a docker container?

Regan’s answer is great, but it’s a bit out of date, since the correct way to do this is avoid the lxc execution context as Docker has dropped LXC as the default execution context as of docker 0.9. Instead it’s better to tell docker about the nvidia devices via the –device flag, and just use … Read more

How can I implement a custom atomic function involving several variables?

As I stated in my second comment above, it’s possible to combine your two 32-bit quantities into a single 64-bit atomically managed quantity, and deal with the problem that way. We then manage the 64-bit quantity atomically using the arbitrary atomic example as a rough guide. Obviously you can’t extend this idea beyond two 32-bit … Read more

Modifying registry to increase GPU timeout, windows 7

The link in your post is correct, you just need to create the corresponding key with the desired value. You will find the TDR Registry Keys description here. The setting you are looking for is TdrDelay Specifies the number of seconds that the GPU can delay the preempt request from the GPU scheduler. This is … Read more

How is CUDA memory managed?

The device memory available to your code at runtime is basically calculated as Free memory = total memory – display driver reservations – CUDA driver reservations – CUDA context static allocations (local memory, constant memory, device code) – CUDA context runtime heap (in kernel allocations, recursive call stack, printf buffer, only on Fermi and newer … Read more