Question

我正在尝试计时推力排序功能。现在，我正在使用cuda事件。但我很好奇如果cuda事件会给我错误的价值。这是因为，在我的计算机上，推力是在34毫秒内在GPU中分类200万个浮点数。但这似乎太快了

我尝试了CPU和GPU时间并得到了以下内容：

CPU（大约需要36毫秒）

__int64 ctr1 = 0 , ctr2 = 0 , freq = 0 ;
    QueryPerformanceFrequency((LARGE_INTEGER *) &freq);
    QueryPerformanceCounter((LARGE_INTEGER *) &ctr1);
    thrust::sort(D.begin(),D.end());
    // transfer data back to host   
    thrust::copy(D.begin(), D.end(), H.begin());
    cudaThreadSynchronize(); // block until kernel is finished

   QueryPerformanceCounter((LARGE_INTEGER *)&ctr2);
    double ans = ((ctr2 - ctr1) * 1.0 / freq);
    printf("The time elapsed in milliseconds is %f\n",(ans*1000));

GPU

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
thrust::sort(D.begin(),D.end());

thrust::copy(D.begin(), D.end(), H.begin());
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime; 
cudaEventElapsedTime(&elapsedTime , start, stop);
printf("time is %f ms", elapsedTime);

请告诉我哪个时间正确

由于

Answer 1

两个时间从不同方面都是正确的。 CPU时序将包括API调用和同步引起的开销。如果您对此开销感兴趣，则应使用CPU计时器。

基于事件的时序正在隔离GPU上的时序，并为您提供GPU执行的时间。

CPU和事件定时之间的其他区别在于，如果thrust :: sort（）是从当前线程第一次调用GPU，则调用将需要设置CUDA上下文并为您提供包含上下文的时序建立。如果使用基于事件的计时，则不会出现此问题，因为在调用cudaEventCreate（）时将设置上下文。

如果你想为获得性能数据计算GPU算法的时间，最好的方法是使用基于事件的计时，但也要多次循环运行算法。

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
for(int i=0; i < 100; i++){
thrust::sort(D.begin(),D.end());

thrust::copy(D.begin(), D.end(), H.begin());
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime; 
cudaEventElapsedTime(&elapsedTime , start, stop);
printf("Avg. time is %f ms", elapsedTime/100);

Answer 2

都不是。我建议您使用CUDA SDK附带的 NVIDIA Visual Profiler 。它将告诉您GPU上每个进程的准确时间。有关该工具的更多信息，请访问Page。

我们应该使用cuda Event来计时推力函数（比如排序）还是应该使用cpu定时器

2 个答案: