cuda - w3toppers.com

Copy an object to device?

Yes, you can copy an object to the device for use on the device. When the object has embedded pointers to dynamically allocated regions, the process requires some extra steps. See my answer here for a discussion of what is involved. That answer also has a few samples code answers linked to it. Also, in … Read more

CUDA and nvcc: using the preprocessor to choose between float or double

It seems you might be conflating two things – how to differentiate between the host and device compilation trajectories when nvcc is processing CUDA code, and how to differentiate between CUDA and non-CUDA code. There is a subtle difference between the two. __CUDA_ARCH__ answers the first question, and __CUDACC__ answers the second. Consider the following … Read more

Any particular function to initialize GPU other than the first cudaMalloc call?

A call to cudaFree(0); is the canonical way to force lazy context establishment in the CUDA runtime. You can’t reduce the overhead, that is a function of driver, runtime and operating system latencies. But the call above will let you control how/when those overheads occur during program execution. EDIT in 2015 to add that the … Read more

Can anyone provide sample code demonstrating the use of 16 bit floating point in cuda?

There are a few things to note up-front: Refer to the half-precision intrinsics. Note that many of these intrinsics are only supported in device code. However, in recent/current CUDA versions, many/most of the conversion intrinsics are supported in both host and device code. (And, @njuffa has created a set of host-usable conversion functions here) Therefore, … Read more

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

The necessary instructions are contained in the documentation for the MPS service. You’ll note that those instructions don’t really depend on or call out MPI, so there really isn’t anything MPI-specific about them. Here’s a walkthrough/example. Read section 2.3 of the above-linked documentation for various requirements and restrictions. I recommend using CUDA 7, 7.5, or … Read more

nvidia-smi Volatile GPU-Utilization explanation?

It is a sampled measurement over a time period. For a given time period, it reports what percentage of time one or more GPU kernel(s) was active (i.e. running). It doesn’t tell you anything about how many SMs were used, or how “busy” the code was, or what it was doing exactly, or in what … Read more

What is a bank conflict? (Doing Cuda/OpenCL programming)

For nvidia (and amd for that matter) gpus the local memory is divided into memorybanks. Each bank can only address one dataset at a time, so if a halfwarp tries to load/store data from/to the same bank the access has to be serialized (this is a bank conflict). For gt200 gpus there are 16 banks … Read more

How to get the CUDA version?

As Jared mentions in a comment, from the command line: nvcc –version (or /usr/local/cuda/bin/nvcc –version) gives the CUDA compiler version (which matches the toolkit version). From application code, you can query the runtime API version with cudaRuntimeGetVersion() or the driver API version with cudaDriverGetVersion() As Daniel points out, deviceQuery is an SDK sample app that … Read more

What can I do against ‘CUDA driver version is insufficient for CUDA runtime version’?

Update your NVIDIA driver. At the moment you have the driver which only supports CUDA 6 or lower, and you are trying to use the CUDA 7.0 toolkit with it.

How are 2D / 3D CUDA blocks divided into warps?

Threads are numbered in order within blocks so that threadIdx.x varies the fastest, then threadIdx.y the second fastest varying, and threadIdx.z the slowest varying. This is functionally the same as column major ordering in multidimensional arrays. Warps are sequentially constructed from threads in this ordering. So the calculation for a 2d block is unsigned int … Read more