CUDA stream compaction algorithm

What you are asking for is a classic parallel algorithm called stream compaction1. If Thrust is an option, you may simply use thrust::copy_if. This is a stable algorithm, it preserves relative order of all elements. Rough sketch: #include <thrust/copy.h> template<typename T> struct is_non_zero { __host__ __device__ auto operator()(T x) const -> bool { return x … Read more

How to get VS 2010 to recognize certain CUDA functions

You could create a dummy #include file of the following form: #pragma once #ifdef __INTELLISENSE__ void __syncthreads(); … #endif This should hide the fake prototypes from the CUDA and Visual C++ compilers, but still make them visible to IntelliSense. Source for __INTELLISENSE__ macro: http://blogs.msdn.com/b/vcblog/archive/2011/03/29/10146895.aspx

Allocate 2D Array on Device Memory in CUDA

I found a solution to this problem. I didn’t have to flatten the array. The inbuilt cudaMallocPitch() function did the job. And I could transfer the array to and from device using cudaMemcpy2D() function. For example cudaMallocPitch((void**) &array, &pitch, a*sizeof(float), b); This creates a 2D array of size a*b with the pitch as passed in … Read more

CUDA determining threads per block, blocks per grid

In general you want to size your blocks/grid to match your data and simultaneously maximize occupancy, that is, how many threads are active at one time. The major factors influencing occupancy are shared memory usage, register usage, and thread block size. A CUDA enabled GPU has its processing capability split up into SMs (streaming multiprocessors), … Read more

Cuda Mutex, why deadlock?

There are other questions here on mutexes. You might want to look at some of them. Search on “cuda critical section”, for example. Assuming that one will work and one won’t because it seemed to work for your test case is dangerous. Managing mutexes or critical sections, especially when the negotiation is amongst threads in … Read more