Question

我正在运行Windows 7 64位，cuda 4.2，visual studio 2010。

首先，我在cuda上运行一些代码，然后将数据下载回主机。然后进行一些处理并返回设备。然后我从设备到主机执行了以下复制，它运行速度非常快，如1ms。

clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

完成需要约1毫秒。

然后我再次在cuda上运行了一些其他代码，主要是原子操作。然后我将数据从设备复制到主机，这需要很长时间，比如~9s。

__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}

start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

~9s

我多次运行代码，例如

int i=0;
while (i<10)
{
clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}

start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;
i++
}

结果几乎相同。
可能是什么问题？

谢谢！

Answer 1

问题是时间问题，而不是复制性能的任何变化。内核启动在CUDA中是异步的，因此您测量的不仅仅是thrust::copy的时间，还包括您启动完成的先前内核的时间。如果您将复制操作的计时代码更改为以下内容：

cudaDeviceSynchronize(); // wait until prior kernel is finished
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

您应该会发现转移时间已恢复到之前的效果。所以你真正的问题不是“为什么thrust::copy慢”，而是“为什么我的内核很慢”。基于你发布的相当可怕的伪代码，答案是“因为它充满了atomicExch()调用，它们序列化内核内存事务”。

Answer 2

我建议你使用 cudpp ，在我看来比推力更快（我正在写关于优化的硕士论文，我试过两个库）。如果复制速度很慢，您可以尝试编写自己的内核来复制数据。

CUDA设备主机拷贝很慢

2 个答案: