cuda - w3toppers.com

CUDA stream compaction algorithm

What you are asking for is a classic parallel algorithm called stream compaction1. If Thrust is an option, you may simply use thrust::copy_if. This is a stable algorithm, it preserves relative order of all elements. Rough sketch: #include <thrust/copy.h> template<typename T> struct is_non_zero { __host__ __device__ auto operator()(T x) const -> bool { return x … Read more

How to get VS 2010 to recognize certain CUDA functions

You could create a dummy #include file of the following form: #pragma once #ifdef __INTELLISENSE__ void __syncthreads(); … #endif This should hide the fake prototypes from the CUDA and Visual C++ compilers, but still make them visible to IntelliSense. Source for __INTELLISENSE__ macro: http://blogs.msdn.com/b/vcblog/archive/2011/03/29/10146895.aspx

Installing theano on Windows 8 with GPU enabled

Theano is a great tool for machine learning applications, yet I found that its installation on Windows is not trivial especially for beginners (like myself) in programming. In my case, I see 5-6x speedups of my scripts when run on a GPU so it was definitely worth the hassle. I wrote this guide based on … Read more

Inconsistency of IDs between ‘nvidia-smi -L’ and cuDeviceGetName()

You can set the device order for CUDA environment in your shell to follow the bus ID instead of the default of fastest card. Requires CUDA 7 and up. export CUDA_DEVICE_ORDER=PCI_BUS_ID

How to disable a specific nvcc compiler warnings

It is actually possible to disable specific warnings on the device with NVCC. It took me ages to figure out how to do it. You need to use the -Xcudafe flag combined with a token listed on this page. For example, to disable the “controlling expression is constant” warning, pass the following to NVCC: -Xcudafe … Read more

Allocate 2D Array on Device Memory in CUDA

I found a solution to this problem. I didn’t have to flatten the array. The inbuilt cudaMallocPitch() function did the job. And I could transfer the array to and from device using cudaMemcpy2D() function. For example cudaMallocPitch((void**) &array, &pitch, a*sizeof(float), b); This creates a 2D array of size a*b with the pitch as passed in … Read more

Are cuda kernel calls synchronous or asynchronous

Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately. On the GPU side, if you haven’t specified different streams … Read more

How does CUDA assign device IDs to GPUs?

Set the environment variable CUDA_DEVICE_ORDER as: export CUDA_DEVICE_ORDER=PCI_BUS_ID Then the GPU IDs will be ordered by pci bus IDs.

CUDA determining threads per block, blocks per grid

In general you want to size your blocks/grid to match your data and simultaneously maximize occupancy, that is, how many threads are active at one time. The major factors influencing occupancy are shared memory usage, register usage, and thread block size. A CUDA enabled GPU has its processing capability split up into SMs (streaming multiprocessors), … Read more

Cuda Mutex, why deadlock?

There are other questions here on mutexes. You might want to look at some of them. Search on “cuda critical section”, for example. Assuming that one will work and one won’t because it seemed to work for your test case is dangerous. Managing mutexes or critical sections, especially when the negotiation is amongst threads in … Read more