cuda - w3toppers.com

How can I check the progress of matrix multiplication?

Here is a code which demonstrates how to check progress from a matrix multiply kernel: #include <stdio.h> #include <stdlib.h> #include <time.h> #define TIME_INC 100000000 #define INCS 10 #define USE_PROGRESS 1 #define MAT_DIMX 4000 #define MAT_DIMY MAT_DIMX #define cudaCheckErrors(msg) \ do { \ cudaError_t __err = cudaGetLastError(); \ if (__err != cudaSuccess) { \ fprintf(stderr, “Fatal … Read more

How to use 2D Arrays in CUDA?

How to allocate 2D array: int main(){ #define BLOCK_SIZE 16 #define GRID_SIZE 1 int d_A[BLOCK_SIZE][BLOCK_SIZE]; int d_B[BLOCK_SIZE][BLOCK_SIZE]; /* d_A initialization */ dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); // so your threads are BLOCK_SIZE*BLOCK_SIZE, 256 in this case dim3 dimGrid(GRID_SIZE, GRID_SIZE); // 1*1 blocks in a grid YourKernel<<<dimGrid, dimBlock>>>(d_A,d_B); //Kernel invocation } How to traverse that array: __global__ void … Read more

multi-GPU basic usage

Since CUDA 4.0 was released, multi-GPU computations of the type you are asking about are relatively easy. Prior to that, you would have need to use a multi-threaded host application with one host thread per GPU and some sort of inter-thread communication system in order to use mutliple GPUs inside the same host application. Now … Read more

How do I select which GPU to run a job on?

The problem was caused by not setting the CUDA_VISIBLE_DEVICES variable within the shell correctly. To specify CUDA device 1 for example, you would set the CUDA_VISIBLE_DEVICES using export CUDA_VISIBLE_DEVICES=1 or CUDA_VISIBLE_DEVICES=1 ./cuda_executable The former sets the variable for the life of the current shell, the latter only for the lifespan of that particular executable invocation. … Read more

CUDA apps time out & fail after several seconds – how to work around this?

I’m not a CUDA expert, — I’ve been developing with the AMD Stream SDK, which AFAIK is roughly comparable. You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious. To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1. You … Read more

CUDA: How many concurrent threads in total?

The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads. Don’t confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored … Read more

In CUDA, what is memory coalescing, and how is it achieved?

It’s likely that this information applies only to compute capabality 1.x, or cuda 2.0. More recent architectures and cuda 3.0 have more sophisticated global memory access and in fact “coalesced global loads” are not even profiled for these chips. Also, this logic can be applied to shared memory to avoid bank conflicts. A coalesced memory … Read more

CUDA: How to use -arch and -code and SM vs COMPUTE

Some related questions/answers are here and here. I am still not sure how to properly specify the architectures for code generation when building with nvcc. A complete description is somewhat complicated, but there are intended to be relatively simple, easy-to-remember canonical usages. Compile for the architecture (both virtual and real), that represents the GPUs you … Read more

What kind of variables consume registers in CUDA?

The register allocation in PTX is completely irrelevant to the final register consumption of the kernel. PTX is only an intermediate representation of the final machine code and uses static single assignment form, meaning that each register in PTX is only used once. A piece of PTX with hundreds of registers can compile into a … Read more

Polymorphism and derived classes in CUDA / CUDA Thrust

I am not going to attempt to answer everything in this question, it is just too large. Having said that here are some observations about the code you posted which might help: The GPU side new operator allocates memory from a private runtime heap. As of CUDA 6, that memory cannot be accessed by the … Read more