Question

根据official CUDA doc，我们有

__host__  __device__ cudaError_t cudaMemcpyAsync ( void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0 )

这意味着它既是主机又是设备功能。但是，在我本地Linux机箱上的实际安装中，我看到/usr/local/cuda/include/cuda_runtime_api.h：

/** CUDA Runtime API Version */
#define CUDART_VERSION  9000
// Many lines away...
extern __host__ __cudart_builtin__ cudaError_t CUDARTAPI cudaMemcpyAsync(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind, cudaStream_t stream __dv(0));

这似乎意味着它严格来说是一个主机功能。

我尝试编译一个调用cudaMemcpyAsync()的简单内核，并得到错误

streaming.cu（338）：错误：调用__host__ 来自__global__的函数（“cudaMemcpyAsync”）函数（“loopy_plus_one”）不允许

这是另一个证据。

所以我真的很困惑：文档是不正确的，还是我的CUDA安装过时了？

编辑：更新 - 如果我更改我的编译命令以显式指定sm_60，即nvcc -arch=sm_60 -o out ./src.cu，则编译错误消失，但会弹出一个新错误：

ptxas致命：未解决的外部功能'cudaMemcpyAsync'

Answer 1

There is a device implementation of cudaMemcpyAsync in the CUDA device runtime API, which you can see documented in the Programming Guide here. There, within the introductory section on Dynamic Parallelism it notes

Dynamic Parallelism is only supported by devices of compute capability 3.5 and higher

and within the documentation it also notes usage of the device runtime API memory functions:

Notes about all memcpy/memset functions:

Only async memcpy/set functions are supported

Only device-to-device memcpy is permitted

May not pass in local or shared memory pointers

You can also find exact instructions for how you must compile and link code which uses the device runtime API:

CUDA programs are automatically linked with the host runtime library when compiled with nvcc, but the device runtime is shipped as a static library which must explicitly be linked with a program which wishes to use it.

The device runtime is offered as a static library (cudadevrt.lib on Windows, libcudadevrt.a under Linux and MacOS), against which a GPU application that uses the device runtime must be linked. Linking of device libraries can be accomplished through nvcc and/or nvlink.

So to make this work you must do exactly three things:

Choose a physical target architecture which is at least compute capability 3.5 when you are compiling
Use separate compilation for device code when you are compiling
Link the CUDA device runtime library

It is for these three reasons (i.e. not doing any of them) that you have seen the compilation and linking errors when trying to use cudaMemcpyAsync inside kernel code.

Answer 2

一旦我正确指定了计算能力，它似乎可行，

nvcc -arch=compute_60 -o out src.cu

在CUDA9中，“cudaMemcpyAsync（）”既是设备又是主机功能？

2 个答案: