假设我想在CUDA中对设备执行异步memcpy主机,然后立即运行内核。如果异步传输已完成,我如何在内核中进行测试?
答案 0 :(得分:8)
使用CUDA“流”对异步复制和内核启动进行排序可确保在异步传输完成后内核执行。以下代码示例演示:
#include <stdio.h>
__global__ void kernel(const int *ptr)
{
printf("Hello, %d\n", *ptr);
}
int main()
{
int *h_ptr = 0;
// allocate pinned host memory with cudaMallocHost
// pinned memory is required for asynchronous copy
cudaMallocHost(&h_ptr, sizeof(int));
// look for thirteen in the output
*h_ptr = 13;
// allocate device memory
int *d_ptr = 0;
cudaMalloc(&d_ptr, sizeof(int));
// create a stream
cudaStream_t stream;
cudaStreamCreate(&stream);
// sequence the asynchronous copy on our stream
cudaMemcpyAsync(d_ptr, h_ptr, sizeof(int), cudaMemcpyHostToDevice, stream);
// sequence the kernel on our stream after the copy
// the kernel will execute after the copy has completed
kernel<<<1,1,0,stream>>>(d_ptr);
// clean up after ourselves
cudaStreamDestroy(stream);
cudaFree(d_ptr);
cudaFreeHost(h_ptr);
}
输出:
$ nvcc -arch=sm_20 async.cu -run
Hello, 13
我不相信有任何受支持的方法可以在内核中测试是否满足某些异步条件(例如异步传输的完成)。假设CUDA线程块完全独立于其他执行线程执行。