Copy an object to device?

Yes, you can copy an object to the device for use on the device. When the object has embedded pointers to dynamically allocated regions, the process requires some extra steps. See my answer here for a discussion of what is involved. That answer also has a few samples code answers linked to it. Also, in … Read more

CUDA and nvcc: using the preprocessor to choose between float or double

It seems you might be conflating two things – how to differentiate between the host and device compilation trajectories when nvcc is processing CUDA code, and how to differentiate between CUDA and non-CUDA code. There is a subtle difference between the two. __CUDA_ARCH__ answers the first question, and __CUDACC__ answers the second. Consider the following … Read more

Can anyone provide sample code demonstrating the use of 16 bit floating point in cuda?

There are a few things to note up-front: Refer to the half-precision intrinsics. Note that many of these intrinsics are only supported in device code. However, in recent/current CUDA versions, many/most of the conversion intrinsics are supported in both host and device code. (And, @njuffa has created a set of host-usable conversion functions here) Therefore, … Read more

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

The necessary instructions are contained in the documentation for the MPS service. You’ll note that those instructions don’t really depend on or call out MPI, so there really isn’t anything MPI-specific about them. Here’s a walkthrough/example. Read section 2.3 of the above-linked documentation for various requirements and restrictions. I recommend using CUDA 7, 7.5, or … Read more

How to get the CUDA version?

As Jared mentions in a comment, from the command line: nvcc –version (or /usr/local/cuda/bin/nvcc –version) gives the CUDA compiler version (which matches the toolkit version). From application code, you can query the runtime API version with cudaRuntimeGetVersion() or the driver API version with cudaDriverGetVersion() As Daniel points out, deviceQuery is an SDK sample app that … Read more

How are 2D / 3D CUDA blocks divided into warps?

Threads are numbered in order within blocks so that threadIdx.x varies the fastest, then threadIdx.y the second fastest varying, and threadIdx.z the slowest varying. This is functionally the same as column major ordering in multidimensional arrays. Warps are sequentially constructed from threads in this ordering. So the calculation for a 2d block is unsigned int … Read more