Why is CUDA pinned memory so fast?

CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).

As described here http://forums.nvidia.com/index.php?showtopic=164661

host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.

I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:

The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().

Leave a Comment