cuda - w3toppers.com

“invalid configuration argument ” error for the call of CUDA kernel?

This type of error message frequently refers to the launch configuration parameters (grid/threadblock dimensions in this case, could also be shared memory, etc. in other cases). When you see a message like this it’s a good idea just to print out your actual config parameters before launching the kernel, to see if you’ve made any … Read more

Best approach for GPGPU/CUDA/OpenCL in Java?

AFAIK, JavaCL / OpenCL4Java is the only OpenCL binding that is available on all platforms right now (including MacOS X, FreeBSD, Linux, Windows, Solaris, all in Intel 32, 64 bits and ppc variants, thanks to its use of JNA). It has demos that actually run fine from Java Web Start at least on Mac and … Read more

Using an array of device function pointers

function pointers are allowed on Fermi. This is how you could do it: typedef double (*func)(double x); __device__ double func1(double x) { return x+1.0f; } __device__ double func2(double x) { return x+2.0f; } __device__ double func3(double x) { return x+3.0f; } __device__ func pfunc1 = func1; __device__ func pfunc2 = func2; __device__ func pfunc3 = … Read more

Which Compute Capability is supported by which CUDA versions?

CUDA Version Min CC Deprecated CC Default CC Max CC 5.5 (and prior) 1.0 N/A 1.0 ? 6.0 1.0 1.0 1.0 ? 6.5 1.1 1.x 2.0 ? 7.x 2.0 N/A 2.0 ? 8.0 2.0 2.x 2.0 6.2 9.x 3.0 N/A 3.0 7.0 10.x 3.0 * N/A 3.0 7.5 11.x 3.5 † 3.x 5.2 11.0:8.0, 11.1:8.6, … Read more

CUDA : How to allocate memory for data member of a class

Perhaps you should include a complete simple example. (If I compile your code above and run it by itself, on linux, I get a seg fault at the second cudaMalloc operation). One wrinkle I see is that since you have in the first step allocated the particle objects in device memory, when you go to … Read more

Multiply Rectangular Matrices in CUDA

After the help of Ira, Ahmad, ram, and Oli Fly, I got the correct answer as follows: #include <wb.h> #define wbCheck(stmt) do { \ cudaError_t err = stmt; \ if (err != cudaSuccess) { \ wbLog(ERROR, “Failed to run stmt “, #stmt); \ return -1; \ } \ } while(0) // Compute C = A … Read more

Compiling code containing dynamic parallelism fails

You can do something like this nvcc -arch=sm_35 -rdc=true simple1.cu -o simple1 -lcudadevrt or If you have 2 files simple1.cu and test.c then you can do something as below. This is called seperate compilation. nvcc -arch=sm_35 -dc simple1.cu nvcc -arch=sm_35 -dlink simple1.o -o link.o -lcudadevrt g++ -c test.c g++ link.o simple1.o test.o -o simple -L/usr/local/cuda/lib64/ … Read more

CUDA function pointers

To get rid of your compile error, you’ll have to use -gencode arch=compute_20,code=sm_20 as a compiler argument when compiling your code. But then you’ll likely have some runtime problems: Taken from the CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#functions Function pointers to __global__ functions are supported in host code, but not in device code. Function pointers to __device__ … Read more

printf() in my CUDA kernel doesn’t result produce any output

printf() output is only displayed if the kernel finishes successfully, so check the return codes of all CUDA function calls and make sure no errors are reported. Furthermore printf() output is only displayed at certain points in the program. Appendix B.32.2 of the Programming Guide lists these as Kernel launch via <<<>>> or cuLaunchKernel() (at … Read more

CUDA allocation alignment is 256 bytes – seriously?

The pointers which are allocated by using any of the CUDA Runtime’s device memory allocation functions e.g cudaMalloc or cudaMallocPitch are guaranteed to be 256 byte aligned, i.e. the address is a multiple of 256. Consider the following example: char *ptr1, *ptr2; int bytes = 1; cudaMalloc((void**)&ptr1,bytes); cudaMalloc((void**)&ptr2,bytes); Suppose the address returned in ptr1 is … Read more