gpgpu - w3toppers.com

Best approach for GPGPU/CUDA/OpenCL in Java?

AFAIK, JavaCL / OpenCL4Java is the only OpenCL binding that is available on all platforms right now (including MacOS X, FreeBSD, Linux, Windows, Solaris, all in Intel 32, 64 bits and ppc variants, thanks to its use of JNA). It has demos that actually run fine from Java Web Start at least on Mac and … Read more

how to use gpu::Stream in OpenCV?

By default all gpu module functions are synchronous, i.e. current CPU thread is blocked until operation finishes. gpu::Stream is a wrapper for cudaStream_t and allows to use asynchronous non-blocking call. You can also read “CUDA C Programming Guide” for detailed information about CUDA asynchronous concurrent execution. Most gpu module functions have additional gpu::Stream parameter. If … Read more

How does CUDA assign device IDs to GPUs?

Set the environment variable CUDA_DEVICE_ORDER as: export CUDA_DEVICE_ORDER=PCI_BUS_ID Then the GPU IDs will be ordered by pci bus IDs.

Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged

Under CUDA 8 with Pascal GPUs, managed memory data migration under a unified memory (UM) regime will generally occur differently than on previous architectures, and you are experiencing the effects of this. (Also see note at the end about CUDA 9 updated behavior for windows.) With previous architectures (e.g. Maxwell), managed allocations used by a … Read more

CUDA apps time out & fail after several seconds – how to work around this?

I’m not a CUDA expert, — I’ve been developing with the AMD Stream SDK, which AFAIK is roughly comparable. You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious. To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1. You … Read more

CUDA: How many concurrent threads in total?

The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads. Don’t confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored … Read more

Should I unify two similar kernels with an ‘if’ statement, risking performance loss?

You have a third alternative, which is to use C++ templating and make the variable which is used in the if/switch statement a template parameter. Instantiate each version of the kernel you need, and then you have multiple kernels doing different things with no branch divergence or conditional evaluation to worry about, because the compiler … Read more

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

The necessary instructions are contained in the documentation for the MPS service. You’ll note that those instructions don’t really depend on or call out MPI, so there really isn’t anything MPI-specific about them. Here’s a walkthrough/example. Read section 2.3 of the above-linked documentation for various requirements and restrictions. I recommend using CUDA 7, 7.5, or … Read more

nvidia-smi Volatile GPU-Utilization explanation?

It is a sampled measurement over a time period. For a given time period, it reports what percentage of time one or more GPU kernel(s) was active (i.e. running). It doesn’t tell you anything about how many SMs were used, or how “busy” the code was, or what it was doing exactly, or in what … Read more

How are 2D / 3D CUDA blocks divided into warps?

Threads are numbered in order within blocks so that threadIdx.x varies the fastest, then threadIdx.y the second fastest varying, and threadIdx.z the slowest varying. This is functionally the same as column major ordering in multidimensional arrays. Warps are sequentially constructed from threads in this ordering. So the calculation for a 2d block is unsigned int … Read more