Cuda kernel returning vectors

something like this should work (coded in browser, not tested): // N is the maximum number of structs to insert #define N 10000 typedef struct { int A, B, C; } Match; __device__ Match dev_data[N]; __device__ int dev_count = 0; __device__ int my_push_back(Match * mt) { int insert_pt = atomicAdd(&dev_count, 1); if (insert_pt < N){ … Read more

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

The necessary instructions are contained in the documentation for the MPS service. You’ll note that those instructions don’t really depend on or call out MPI, so there really isn’t anything MPI-specific about them. Here’s a walkthrough/example. Read section 2.3 of the above-linked documentation for various requirements and restrictions. I recommend using CUDA 7, 7.5, or … Read more

128 bit integer on cuda?

For best performance, one would want to map the 128-bit type on top of a suitable CUDA vector type, such as uint4, and implement the functionality using PTX inline assembly. The addition would look something like this: typedef uint4 my_uint128_t; __device__ my_uint128_t add_uint128 (my_uint128_t addend, my_uint128_t augend) { my_uint128_t res; asm (“add.cc.u32 %0, %4, %8;\n\t” … Read more

How do CUDA blocks/warps/threads map onto CUDA cores?

Two of the best references are NVIDIA Fermi Compute Architecture Whitepaper GF104 Reviews I’ll try to answer each of your questions. The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a … Read more