Where is pinned memory allocated using cudaHostAlloc?

时间:2018-03-25 19:59:03

标签: cuda

I am reading Page-Locked Host Memory in Cuda Programming Guide and want to know where this pinned memory allocated when created using function cudaHostAlloc? Is it in kernel address space? Or is it allocated in process address space?

1 个答案:

答案 0 :(得分:4)

"Page-Locked Host Memory" for CUDA (and other DMA-capable external hardware like PCI-express cards) is allocated in physical memory of the Host computer. The allocation is marked as not-swappable (not-pageable) and not-movable (locked, pinned). This is similar to the action of mlock syscall "lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area."

This allocation can be accessed by kernel virtual address space (as kernel has full view of the physical memory) and this allocation is also added to the user process virtual address space to allow process access it.

When you does ordinary malloc, actual physical memory allocation may (and will) be postponed to the first (write) access to the pages. With mlocked/pinned memory all physical pages are allocated inside locking or pinning calls (like MAP_POPULATE in mmap: "Populate (prefault) page tables for a mapping"), and physical addresses of pages will not change (no swapping, no moving, no compacting...).

CUDA docs: http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902

__host__ ​cudaError_t cudaHostAlloc ( void** pHost, size_t size, unsigned int flags )

Allocates page-locked memory on the host. ...

Allocates size bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc(). Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

...

Memory allocated by this function must be freed with cudaFreeHost().

Pinned and not-pinned memory compared: https://www.cs.virginia.edu/~mwb7w/cuda_support/pinned_tradeoff.html "Choosing Between Pinned and Non-Pinned Memory"

Pinned memory is memory allocated using the cudaMallocHost function, which prevents the memory from being swapped out and provides improved transfer speeds. Non-pinned memory is memory allocated using the malloc function. As described in Memory Management Overhead and Memory Transfer Overhead, pinned memory is much more expensive to allocate and deallocate but provides higher transfer throughput for large memory transfers.

CUDA forums post with advises from txbob moderator: https://devtalk.nvidia.com/default/topic/899020/does-cudamemcpyasync-require-pinned-memory-/ "Does cudaMemcpyAsync require pinned memory?"

If you want truly asynchronous behavior (e.g. overlap of copy and compute) then the memory must be pinned. If it is not pinned, there won't be any runtime errors, but the copy will not be asynchronous - it will be performed like an ordinary cudaMemcpy.

The usable size may vary by system and OS. Pinning 4GB of memory on a 64GB system on Linux should not have a significant effect on CPU performance, after the pinning operation is complete. Attempting to pin 60GB on the other hand might cause significant system responsiveness issues.