单个布尔的CudaMemcpy需要太长时间

时间:2013-12-02 23:38:33

标签: cuda

我正在CUDA上编写K-means聚类,我遇到了一个非常奇怪的问题。我需要在聚类迭代开始时将bool设置为false,然后再将其读回。但是,与其他东西相比,bool的memcpy花费的时间太长了,请参阅图表。

黄色(最顶层) ~3 ms :所有数据的初始cudaMalloc和cudaMemcpy(大约7MB浮点数),共6个阵列。

红色 ~2 ms :单个布尔值的复制

蓝色 ~10 ms :群集本身

绿色 ~4 ms :并行缩减 - 每个线程结果与一个结果的总和(平均约20个内核调用)

紫色/棕色< 1毫秒:cudaFree

Graph of times

代码本身如下:

bool changed = false;
...
float start = sdkGetTimerValue(&stopWatch);
cudaMemcpy(dev_changed, &changed, sizeof(bool), cudaMemcpyHostToDevice);
float end = sdkGetTimerValue(&stopWatch);
...

对于时间测量,我使用来自CUDA示例的TimerHelper.h,这些示例在内部使用QueryPerformanceCounter进行时间测量。我检查了大约10次的时间测量,因为我无法相信这一点。如果我把cudaMemcpy拿出来,那么大部分就会接近0(不像2毫秒)。

奇怪的是,当我比较布尔复制时间(红色)和平行缩减(绿色)时,它是1 memcpy与20个内核调用。

所以我试着编写一个只将bool设置为false的内核:

__global__ void SetToFalse(bool* boolean) {
    boolean[0] = false;
}

void LaunchSetToFalse(bool* boolean) {
    SetToFalse<<<1, 1>>>(boolean);
}

然后将代码更改为:

...
float start = sdkGetTimerValue(&stopWatch);
LaunchSetToFalse(dev_changed);
cudaDeviceSynchronize();
float end = sdkGetTimerValue(&stopWatch);
...

但它仍然需要大约2毫秒(没有变化)。

我错过了一些明显的东西吗?是什么让bool副本这么慢?此外,绿色包含从GPU到CPU的布尔值复制,它占用绿色块的大约一半。我非常仔细地检查了时间测量,不应该有任何错误;然而,结果太奇怪了。

有没有更好的方法如何从线程报告bool?提前感谢任何建议。

编辑:这是代码的更大部分:

cudaMalloc((void**) &dev_data, data.getRawSize() * sizeof(float));
cudaMalloc((void**) &dev_linearizedClusterCenters, clustersCount * size_dim * sizeof(float)); 
cudaMalloc((void**) &dev_outClusters, rowsCount * sizeof(byte));

cudaMalloc((void**) &dev_changed, sizeof(bool));
cudaMalloc((void**) &dev_count, numOfThreadBlocks * clustersCount * sizeof(int));
cudaMalloc((void**) &dev_newData, numOfThreadBlocks * clustersCount * size_dim * sizeof(float));

cudaMemcpy(dev_data, data.getDataPointer() , data.getRawSize() * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(dev_outClusters, outClusters , rowsCount * sizeof(byte), cudaMemcpyHostToDevice);
cudaMemcpy(dev_linearizedClusterCenters, inOutLinearizedClusterCenters , clustersCount * size_dim  * sizeof(float), cudaMemcpyHostToDevice);

perfRecorder.endInit();
for (int global_count = 0; (global_count < maxIterations) && changed[0]; global_count++) {
    perfRecorder.startIteration();
    /* <<<<<<<<<< THIS GUY TAKES WAY TOO LONG >>>>>>>>>>>>>> */
    changed[0] = false;
    cudaMemcpy(dev_changed, changed, sizeof(bool), cudaMemcpyHostToDevice);
    //LaunchSetToFalse(dev_changed);
    //cudaDeviceSynchronize();

    perfRecorder.startIterationCompute();
    LaunchKernel(dev_data, dev_linearizedClusterCenters, dev_outClusters, dev_changed, dev_count, dev_newData, clustersCount, size_dim, rowsCount, numOfThreads, numOfBlocks);
    cudaDeviceSynchronize();

    perfRecorder.endIterationCompute();
    cudaMemcpy(changed, dev_changed, sizeof(bool), cudaMemcpyDeviceToHost);

    LaunchParallelSummationInt(dev_count, numOfThreadBlocks * clustersCount, numOfThreads,
        parallelReductionIterations);
    LaunchParallelSummationFloat(dev_newData, numOfThreadBlocks * clustersCount * size_dim, numOfThreads,
        parallelReductionIterations);
    LaunchCountNewClusterCenters(dev_newData, dev_count, clustersCount, size_dim, dev_linearizedClusterCenters, numOfThreads);

    cudaDeviceSynchronize();  // <<<<<<< EDIT 2
    perfRecorder.endIteration();
}// End of for

perfRecorder.startCleanup();
cudaMemcpy(outClusters, dev_outClusters, rowsCount * sizeof(byte), cudaMemcpyDeviceToHost);

cudaFree(dev_data);
cudaFree(dev_linearizedClusterCenters);
cudaFree(dev_outClusters);

cudaFree(dev_changed);
cudaFree(dev_count);
cudaFree(dev_newData);
perfRecorder.endCleanup();

编辑2:正如@Robert Crovella正确建议的那样,我在perfRecorder.endIteration();之前放置了同步。图形变得更好但仍然,1字节的传输需要相当长的时间(~1 ms):

Bench after sync

编辑3:我正在使用Windows,时间码只是将毫秒数保存为浮点数然后再减去它们。我无视秒表的延迟。

/****** Code from NVIDIA sample code in TimeHelper.h ******/
inline float StopWatchWin::getTime() {
    // Return the TOTAL time to date
    float retval = total_time;
    if (running) {
        LARGE_INTEGER temp;
        QueryPerformanceCounter((LARGE_INTEGER *) &temp);
        retval += (float)(((double)(temp.QuadPart - start_time.QuadPart)) / freq);
    }
    return retval;
}

0 个答案:

没有答案