Question

请参阅以下两个快照，其中显示了我的CUDA代码的Nvidia Visual Profiler会话：

来自nvprof会话的快照显示了推力:: sort和thrust :: reduce调用执行时间表 Snapshot from nvprof session showing thrust::sort and thrust::reduce call execution timeline

突出显示排序并减少调用以显示所执行的时间以及执行之间的差距 Highlighted the sort and reduce calls to show the times taken and the gap in between their execution

您可以在两个thrust::sort()来电之间看到大约70微秒的差距，然后第一个thrust::reduce()和第二个thrust::sort()来电之间存在很大差距。总之，快照中可见大约300个这样的间隙。我相信这些是闲置＆＃39;时间，也许是由推力库引入的。无论如何，我无法通过Nvidia找到任何相关的讨论或文档。有人可以解释为什么我有这样明显的 闲置＆＃39;次？合并后，这些时间占我申请执行时间的40％，所以这对我来说是一个很大的问题！

另外，我已经测量过我写的连续cuda内核调用之间的差距只有3个我们！

我已经写了一个示例cuda代码，以便发布在这里：

void profileThrustSortAndReduce(const int ARR_SIZE) { // for thrust::reduce on first 10% of the sorted array const int ARR_SIZE_BY_10 = ARR_SIZE / 10; // generate host random arrays of float float* h_arr1; cudaMallocHost((void **)&h_arr1, ARR_SIZE * sizeof(float)); float* h_arr2; cudaMallocHost((void **)&h_arr2, ARR_SIZE * sizeof(float)); for (int i = 0; i < ARR_SIZE; i++) { h_arr1[i] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX)* 1000.0f; h_arr2[i] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX)* 1000.0f; } // device arrays populated float* d_arr1; cudaMalloc((void **)&d_arr1, ARR_SIZE * sizeof(float)); float* d_arr2; cudaMalloc((void **)&d_arr2, ARR_SIZE * sizeof(float)); cudaMemcpy(d_arr1, h_arr1, ARR_SIZE * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_arr2, h_arr2, ARR_SIZE * sizeof(float), cudaMemcpyHostToDevice); // start cuda profiler cudaProfilerStart(); // sort the two device arrays thrust::sort(thrust::device, d_arr1, d_arr1 + ARR_SIZE); thrust::sort(thrust::device, d_arr2, d_arr2 + ARR_SIZE); // mean of 100 percentiles of device array float arr1_red_100pc_mean = thrust::reduce(thrust::device, d_arr1, d_arr1 + ARR_SIZE) / ARR_SIZE; // mean of smallest 10 percentiles of device array float arr1_red_10pc_mean = thrust::reduce(thrust::device, d_arr1, d_arr1 + ARR_SIZE_BY_10) / ARR_SIZE_BY_10; // mean of 100 percentiles of device array float arr2_red_100pc_mean = thrust::reduce(thrust::device, d_arr2, d_arr2 + ARR_SIZE) / ARR_SIZE; // mean of smallest 10 percentiles of device array float arr2_red_10pc_mean = thrust::reduce(thrust::device, d_arr2, d_arr2 + ARR_SIZE_BY_10) / ARR_SIZE_BY_10; // stop cuda profiler cudaProfilerStop(); }

此示例函数的nvprof会话快照

Answer 1

差距主要是由cudaMalloc操作引起的。 thrust::sort并且可能thrust::reduce分配（和免费）与其活动相关联的临时存储空间。

您已将此部分时间轴从您粘贴到问题中的前2张图片中删除，但在第3张图片中显示的时间轴部分上方，您会发现{{1}运行在＆＃34;运行时API＆＃34;剖析线。

这些cudaMalloc（和cudaMalloc）操作非常耗时且同步。要解决此问题，典型的建议是使用thrust custom allocator（也是here）。这样，您可以在程序开始时为所需的必要大小分配一次，而不必在每次拨打电话时产生分配/免费开销。

或者，您可以浏览cub，它已经为您分配了分配和处理步骤。

为什么在连续推力排序和减少命令之间GPU上没有活动？

1 个答案: