gpgpu - w3toppers.com

How do CUDA blocks/warps/threads map onto CUDA cores?

Two of the best references are NVIDIA Fermi Compute Architecture Whitepaper GF104 Reviews I’ll try to answer each of your questions. The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a … Read more

Passing Host Function as a function pointer in global OR device function in CUDA

Yes, for a GPU implementation of Calc, you should pass the GetInv as a __device__ function pointer. It is possible, here are some worked examples: Ex. 1 Ex. 2 Ex. 3 Most of the above examples demonstrate bringing the device function pointer all the way back to the host code. This may not be necessary … Read more

How to measure the inner kernel time in NVIDIA CUDA?

You can do something like this: __global__ void kernelSample(int *runtime) { // …. clock_t start_time = clock(); //some code here clock_t stop_time = clock(); // …. runtime[tidx] = (int)(stop_time – start_time); } Which gives the number of clock cycles between the two calls. Be a little careful though, the timer will overflow after a couple … Read more

sending 3d array to CUDA kernel

First of all, I think talonmies when he posted the response to the previous question you mention, was not intending that to be representative of good coding. So figuring out how to extend it to 3D might not be the best use of your time. For example, why do we want to write programs which … Read more

Modifying registry to increase GPU timeout, windows 7

The link in your post is correct, you just need to create the corresponding key with the desired value. You will find the TDR Registry Keys description here. The setting you are looking for is TdrDelay Specifies the number of seconds that the GPU can delay the preempt request from the GPU scheduler. This is … Read more

Utilizing the GPU with c# [closed]

[Edit OCT 2017 as even this answer gets quite old] Most of these answers are quite old, so I thought I’d give an updated summary of where I think each project is: GPU.Net (TidePowerd) – I tried this 6 months ago or so, and did get it working though it took a little bit of … Read more

CUDA limit seems to be reached, but what limit is that?

The resource which is being exhausted is time. On all current CUDA platforms, the display driver includes a watchdog timer which will kill any kernel which takes more than a few seconds to execute. Running code on a card which is running a display is subject to this limit. On the WDDM Windows platforms you … Read more

Fastest sort of fixed length 6 int array

For any optimization, it’s always best to test, test, test. I would try at least sorting networks and insertion sort. If I were betting, I’d put my money on insertion sort based on past experience. Do you know anything about the input data? Some algorithms will perform better with certain kinds of data. For example, … Read more