我正在CUDA上编写K-means聚类,我遇到了一个非常奇怪的问题。我需要在聚类迭代开始时将bool设置为false,然后再将其读回。但是,与其他东西相比,bool的memcpy
花费的时间太长了,请参阅图表。
黄色(最顶层) ~3 ms :所有数据的初始cudaMalloc和cudaMemcpy(大约7MB浮点数),共6个阵列。
红色 ~2 ms :单个布尔值的复制
蓝色 ~10 ms :群集本身
绿色 ~4 ms :并行缩减 - 每个线程结果与一个结果的总和(平均约20个内核调用)
紫色/棕色< 1毫秒:cudaFree
代码本身如下:
bool changed = false;
...
float start = sdkGetTimerValue(&stopWatch);
cudaMemcpy(dev_changed, &changed, sizeof(bool), cudaMemcpyHostToDevice);
float end = sdkGetTimerValue(&stopWatch);
...
对于时间测量,我使用来自CUDA示例的TimerHelper.h,这些示例在内部使用QueryPerformanceCounter
进行时间测量。我检查了大约10次的时间测量,因为我无法相信这一点。如果我把cudaMemcpy拿出来,那么大部分就会接近0(不像2毫秒)。
奇怪的是,当我比较布尔复制时间(红色)和平行缩减(绿色)时,它是1 memcpy
与20个内核调用。
所以我试着编写一个只将bool设置为false的内核:
__global__ void SetToFalse(bool* boolean) {
boolean[0] = false;
}
void LaunchSetToFalse(bool* boolean) {
SetToFalse<<<1, 1>>>(boolean);
}
然后将代码更改为:
...
float start = sdkGetTimerValue(&stopWatch);
LaunchSetToFalse(dev_changed);
cudaDeviceSynchronize();
float end = sdkGetTimerValue(&stopWatch);
...
但它仍然需要大约2毫秒(没有变化)。
我错过了一些明显的东西吗?是什么让bool副本这么慢?此外,绿色包含从GPU到CPU的布尔值复制,它占用绿色块的大约一半。我非常仔细地检查了时间测量,不应该有任何错误;然而,结果太奇怪了。
有没有更好的方法如何从线程报告bool?提前感谢任何建议。
编辑:这是代码的更大部分:
cudaMalloc((void**) &dev_data, data.getRawSize() * sizeof(float));
cudaMalloc((void**) &dev_linearizedClusterCenters, clustersCount * size_dim * sizeof(float));
cudaMalloc((void**) &dev_outClusters, rowsCount * sizeof(byte));
cudaMalloc((void**) &dev_changed, sizeof(bool));
cudaMalloc((void**) &dev_count, numOfThreadBlocks * clustersCount * sizeof(int));
cudaMalloc((void**) &dev_newData, numOfThreadBlocks * clustersCount * size_dim * sizeof(float));
cudaMemcpy(dev_data, data.getDataPointer() , data.getRawSize() * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(dev_outClusters, outClusters , rowsCount * sizeof(byte), cudaMemcpyHostToDevice);
cudaMemcpy(dev_linearizedClusterCenters, inOutLinearizedClusterCenters , clustersCount * size_dim * sizeof(float), cudaMemcpyHostToDevice);
perfRecorder.endInit();
for (int global_count = 0; (global_count < maxIterations) && changed[0]; global_count++) {
perfRecorder.startIteration();
/* <<<<<<<<<< THIS GUY TAKES WAY TOO LONG >>>>>>>>>>>>>> */
changed[0] = false;
cudaMemcpy(dev_changed, changed, sizeof(bool), cudaMemcpyHostToDevice);
//LaunchSetToFalse(dev_changed);
//cudaDeviceSynchronize();
perfRecorder.startIterationCompute();
LaunchKernel(dev_data, dev_linearizedClusterCenters, dev_outClusters, dev_changed, dev_count, dev_newData, clustersCount, size_dim, rowsCount, numOfThreads, numOfBlocks);
cudaDeviceSynchronize();
perfRecorder.endIterationCompute();
cudaMemcpy(changed, dev_changed, sizeof(bool), cudaMemcpyDeviceToHost);
LaunchParallelSummationInt(dev_count, numOfThreadBlocks * clustersCount, numOfThreads,
parallelReductionIterations);
LaunchParallelSummationFloat(dev_newData, numOfThreadBlocks * clustersCount * size_dim, numOfThreads,
parallelReductionIterations);
LaunchCountNewClusterCenters(dev_newData, dev_count, clustersCount, size_dim, dev_linearizedClusterCenters, numOfThreads);
cudaDeviceSynchronize(); // <<<<<<< EDIT 2
perfRecorder.endIteration();
}// End of for
perfRecorder.startCleanup();
cudaMemcpy(outClusters, dev_outClusters, rowsCount * sizeof(byte), cudaMemcpyDeviceToHost);
cudaFree(dev_data);
cudaFree(dev_linearizedClusterCenters);
cudaFree(dev_outClusters);
cudaFree(dev_changed);
cudaFree(dev_count);
cudaFree(dev_newData);
perfRecorder.endCleanup();
编辑2:正如@Robert Crovella正确建议的那样,我在perfRecorder.endIteration();
之前放置了同步。图形变得更好但仍然,1字节的传输需要相当长的时间(~1 ms):
编辑3:我正在使用Windows,时间码只是将毫秒数保存为浮点数然后再减去它们。我无视秒表的延迟。
/****** Code from NVIDIA sample code in TimeHelper.h ******/
inline float StopWatchWin::getTime() {
// Return the TOTAL time to date
float retval = total_time;
if (running) {
LARGE_INTEGER temp;
QueryPerformanceCounter((LARGE_INTEGER *) &temp);
retval += (float)(((double)(temp.QuadPart - start_time.QuadPart)) / freq);
}
return retval;
}