CUDA中的浮点最小值/最大值比CPU版本慢。为什么?

时间:2017-06-09 03:41:40

标签: performance cuda

我编写了一个内核,用于使用约简来计算大约100,000个浮点数组的最小值和最大值(参见下面的代码)。我使用线程块将1024个值的块减少到单个值(在共享内存中),然后在CPU上的块之间进行最终减少。

然后我将其与CPU上的串行计算进行了比较。 CUDA版本需要2.2毫秒,CPU版本需要0.21毫秒。为什么CUDA版本要慢得多?数组大小是否不足以利用并行性,或者我的代码是否以某种方式未优化?

这是Udacity Parallel Programming类练习的一部分。我通过他们的网站运行这个,所以我不知道确切的硬件是什么,但他们声称代码在实际的GPU上运行。

这是CUDA代码:

__global__ void min_max_kernel(const float* const d_logLuminance,
                            const size_t length,
                            float* d_min_logLum,
                            float* d_max_logLum) {
    // Shared working memory
    extern __shared__ float sh_logLuminance[];

    int blockWidth = blockDim.x;
    int x = blockDim.x * blockIdx.x + threadIdx.x;

    float* min_logLuminance = sh_logLuminance;
    float* max_logLuminance = sh_logLuminance + blockWidth;

    // Copy this block's chunk of the data to shared memory
    // We copy twice so we compute min and max at the same time
    if (x < length) {
        min_logLuminance[threadIdx.x] = d_logLuminance[x];
        max_logLuminance[threadIdx.x] = min_logLuminance[threadIdx.x];
    }
    else {
        // Pad if we're out of range
        min_logLuminance[threadIdx.x] = FLT_MAX;
        max_logLuminance[threadIdx.x] = -FLT_MAX;
    }

    __syncthreads();

    // Reduce
    for (int s = blockWidth/2; s > 0; s /= 2) {
        if (threadIdx.x < s) {
            if (min_logLuminance[threadIdx.x + s] < min_logLuminance[threadIdx.x]) {
                min_logLuminance[threadIdx.x] = min_logLuminance[threadIdx.x + s];
            }

            if (max_logLuminance[threadIdx.x + s] > max_logLuminance[threadIdx.x]) {
                max_logLuminance[threadIdx.x] = max_logLuminance[threadIdx.x + s];
            }
        }

        __syncthreads();
    }

    // Write to global memory
    if (threadIdx.x == 0) {
        d_min_logLum[blockIdx.x] = min_logLuminance[0];
        d_max_logLum[blockIdx.x] = max_logLuminance[0];
    }
}

size_t get_num_blocks(size_t inputLength, size_t threadsPerBlock) {
    return inputLength / threadsPerBlock +
        ((inputLength % threadsPerBlock == 0) ? 0 : 1);
}

/*
* Compute min, max over the data by first reducing on the device, then
* doing the final reducation on the host.
*/
void compute_min_max(const float* const d_logLuminance,
                    float& min_logLum,
                    float& max_logLum,
                    const size_t numRows,
                    const size_t numCols) {
    // Compute min, max
    printf("\n=== computing min/max ===\n");
    const size_t blockWidth = 1024;
    const size_t numPixels = numRows * numCols;
    size_t numBlocks = get_num_blocks(numPixels, blockWidth);

    printf("Num min/max blocks = %d\n", numBlocks);

    float* d_min_logLum;
    float* d_max_logLum;
    int alloc_size = sizeof(float) * numBlocks;
    checkCudaErrors(cudaMalloc(&d_min_logLum, alloc_size));
    checkCudaErrors(cudaMalloc(&d_max_logLum, alloc_size));

    min_max_kernel<<<numBlocks, blockWidth, sizeof(float) * blockWidth * 2>>>
        (d_logLuminance, numPixels, d_min_logLum, d_max_logLum);

    float* h_min_logLum = (float*) malloc(alloc_size);
    float* h_max_logLum = (float*) malloc(alloc_size);
    checkCudaErrors(cudaMemcpy(h_min_logLum, d_min_logLum, alloc_size, cudaMemcpyDeviceToHost));
    checkCudaErrors(cudaMemcpy(h_max_logLum, d_max_logLum, alloc_size, cudaMemcpyDeviceToHost));

    min_logLum = FLT_MAX;
    max_logLum = -FLT_MAX;

    // Reduce over the block results
    // (would be a bit faster to do it on the GPU, but it's just 96 numbers)
    for (int i = 0; i < numBlocks; i++) {
        if (h_min_logLum[i] < min_logLum) {
            min_logLum = h_min_logLum[i];
        }
        if (h_max_logLum[i] > max_logLum) {
            max_logLum = h_max_logLum[i];
        }
    }

    printf("min_logLum = %.2f\nmax_logLum = %.2f\n", min_logLum, max_logLum);

    checkCudaErrors(cudaFree(d_min_logLum));
    checkCudaErrors(cudaFree(d_max_logLum));
    free(h_min_logLum);
    free(h_max_logLum);
}

这是主机版本:

void compute_min_max_on_host(const float* const d_logLuminance, size_t numPixels) {
    int alloc_size = sizeof(float) * numPixels;
    float* h_logLuminance = (float*) malloc(alloc_size);
    checkCudaErrors(cudaMemcpy(h_logLuminance, d_logLuminance, alloc_size, cudaMemcpyDeviceToHost));
    float host_min_logLum = FLT_MAX;
    float host_max_logLum = -FLT_MAX;
    printf("HOST ");
    for (int i = 0; i < numPixels; i++) {
        if (h_logLuminance[i] < host_min_logLum) {
            host_min_logLum = h_logLuminance[i];
        }
        if (h_logLuminance[i] > host_max_logLum) {
            host_max_logLum = h_logLuminance[i];
        }
    }
    printf("host_min_logLum = %.2f\nhost_max_logLum = %.2f\n",
        host_min_logLum, host_max_logLum);
    free(h_logLuminance);
}

1 个答案:

答案 0 :(得分:2)

  1. 正如@talonmies所说,对于较大的尺寸,行为可能会有所不同; 100,000真的不是那么多:它大部分都适合现代CPU核心的整体L1缓存;其中一半适合单核的二级缓存。
  2. 通过PCI express转移需要时间;在你的情况下,它可能有两倍的时间,因为你没有使用固定内存。
  3. 你没有重叠计算和PCI Express I / O(并不是只对100,000个元素有意义)
  4. 你的内核相当慢,原因不止一个;其中最重要的是共享内存的大量使用,其中大部分都是不必要的
  5. 更一般地说:始终使用nvvp(或nvprof来分析您的代码,以获取文字信息以供进一步分析。)