Question

我为Mean Filter版本和CPU serial NVIDIA版本实现了图像GPU parallel代码。我得到了运行时间（请参阅测试用例的results和设备的specs。为什么case 2具有highest的加速和case 3 lowest提速了吗？

GPU执行配置

        int block_size = 32;
        int grid_size = width/block_size; //width of the image in pixels
        dim3 dimBlock(block_size, block_size, 1);
        dim3 dimGrid(grid_size, grid_size, 1);

GPU代码的时间测量

        clock_t start_d=clock();
        meanFilter_d <<< dimGrid, dimBlock >>> (image_data_d, result_image_data_d, width, height, half_window);
        cudaThreadSynchronize();
        clock_d end_d=clock();

CPU代码的时间测量（单线程）

        clock_t start_h = clock();
        meanFilter_h(data, result_image_data_h1, width, height, window_size);
        clock_t end_h = clock();

主机代码

void meanFilter_h(unsigned char* raw_image_matrix,unsigned char* filtered_image_data,int image_width, int image_height, int window_size)
{
    // int size = 3 * image_width * image_height;
    int half_window = (window_size-window_size % 2)/2;
    for(int i = 0; i < image_height; i += 1){
        for(int j = 0; j < image_width; j += 1){
            int k = 3*(i*image_height+j);
            int top, bottom, left, right; 
            if(i-half_window >= 0){top = i-half_window;}else{top = 0;}// top limit
            if(i+half_window <= image_height-1){bottom = i+half_window;}else{bottom = image_height-1;}// bottom limit
            if(j-half_window >= 0){left = j-half_window;}else{left = 0;}// left limit
            if(j+half_window <= image_width-1){right = j+half_window;}else{right = image_width-1;}// right limit
            double first_byte = 0; 
            double second_byte = 0; 
            double third_byte = 0; 
            // move inside the window
            for(int x = top; x <= bottom; x++){
                for(int y = left; y <= right; y++){
                    int pos = 3*(x*image_height + y); // three bytes
                    first_byte += raw_image_matrix[pos];
                    second_byte += raw_image_matrix[pos+1];
                    third_byte += raw_image_matrix[pos+2];
                }
            }
            int effective_window_size = (bottom-top+1)*(right-left+1);
            filtered_image_data[k] = first_byte/effective_window_size;
            filtered_image_data[k+1] = second_byte/effective_window_size;
            filtered_image_data[k+2] =third_byte/effective_window_size;


        }
    }
}

设备代码

__global__ void meanFilter_d(unsigned char* raw_image_matrix, unsigned char* filtered_image_data, int image_width, int image_height, int half_window)
{
    int j = blockIdx.x * blockDim.x + threadIdx.x;
    int i = blockIdx.y * blockDim.y + threadIdx.y;

    if (i < image_height && j < image_width){
        int k = 3*(i*image_height+j);
        int top, bottom, left, right; 
        if(i-half_window >= 0){top = i-half_window;}else{top = 0;}// top limit
        if(i+half_window <= image_height-1){bottom = i+half_window;}else{bottom = image_height-1;}// bottom limit
        if(j-half_window >= 0){left = j-half_window;}else{left = 0;}// left limit
        if(j+half_window <= image_width-1){right = j+half_window;}else{right = image_width-1;}// right limit
        double first_byte = 0; 
        double second_byte = 0; 
        double third_byte = 0; 
        // move inside the window
        for(int x = top; x <= bottom; x++){
            for(int y = left; y <= right; y++){
                int pos = 3*(x*image_height + y); // three bytes
                first_byte += raw_image_matrix[pos];
                second_byte += raw_image_matrix[pos+1];
                third_byte += raw_image_matrix[pos+2];
            }
        }
        int effective_window_size = (bottom-top+1)*(right-left+1);
        filtered_image_data[k] = first_byte/effective_window_size;
        filtered_image_data[k+1] = second_byte/effective_window_size;
        filtered_image_data[k+2] =third_byte/effective_window_size;
    }
}

可以看出，3×3内核的两个图像大小都比5*5内核慢。由于较大的图像大小，情况1比情况3具有更多的并行性。因此，情况1的设备的利用率高于情况3的利用率。但是我不打算进一步解释。请给我一些见识。

Answer 1

首先要指出的是：您要测量什么，最重要的是如何？从您的问题中不可能推断出特别是如何。

无论如何，我强烈建议您看一下它，这是Mark Harris撰写的非常简单和有用的article，其中介绍了一些用于采样设备端代码（例如CUDA内存传输，内核）执行时间的良好做法。等）。

通过尝试获得 CPU / GPU加速的方式是一个非常棘手的话题，这是由于两种架构的本质不同。即使您的CPU和GPU代码显然在做相同的事情，也有很多您可能要考虑的因素（例如CPU内核，GPU Streaming Multiprocessor和每个SM的内核）。 Here罗伯特·克罗维拉（Robert Crovella）对类似的问题给出了很好的答案，就像他说的那样：

如果您对“ GPU比XX的CPU速度更快”有任何主张，那么IMO建议您只比较做相同工作并有效地使用基础架构的代码（对于CPU和CPU）。 GPU）。例如，在CPU情况下，您当然应该使用多线程代码，以便利用大多数现代CPU提供的多个CPU内核。无论如何，这些主张都可能会引起怀疑，因此，除非您的意图很关键，否则最好避免使用它们。

我建议您也来看看this的讨论。

经过一些前提之后，我认为您不能认为这些提速是可靠的（实际上，这些提速对我来说有点奇怪）。
试图解释您想说的话：

可以看出，两个3×3内核的图像尺寸都较慢

也许您想说的是，在3x3中，w.r.t的提升速度较小。 5x5窗口大小的广告素材。尝试更加准确。

为什么情况2的加速最高而情况3的加速最低？

很难通过您提供的不良信息来推断出某些东西。

请添加：一些代码，以查看您在做什么以及如何在设备和主机情况下实现此问题，并描述您正在测量的方式和内容。

编辑：

好吧，我认为您应该以更准确的方式采取措施。

首先，我建议您使用比clock()更准确的替代方法。看看答案here和C ++参考，建议您考虑使用

std::chrono::system_clock::now()

std::chrono::high_resolution_clock::now();

然后，我重复一遍，以阅读Mark Harris的文章（以上链接）。他在这里说

使用主机设备同步点（例如cudaDeviceSynchronize()）的问题是它们使GPU管线停滞不前。因此，CUDA通过CUDA事件API提供了相对较轻的CPU计时器替代方案。 CUDA事件API包括用于创建和销毁事件，记录事件以及计算两次记录的事件之间的经过时间（以毫秒为单位）的调用。

这意味着您使用cudaDeviceSynchronize()所提供的度量的实际结果可能会有些“失真”。此外，如果您使用简单的cudaMemcpy，则不必使用同步机制，因为它是同步调用。

还考虑将H2D / D2H传输包括在内，根据我的看法，在CPU / GPU比较中考虑这一开销很重要（但此选择取决于您）；
关于您在图片中采取的措施，它们是直接结果吗？或重复执行不同动作的平均值（可能会丢弃值）？

我认为您应该按照上述建议对新措施进行抽样，并考虑获得的新措施。

你说过

由于图像尺寸较大，因此案例1具有比案例3更多的并行度。因此，情况1的设备利用率高于情况3。

我不同意，因为您int grid_size = width/block_size;

案例1： grid_size = 640/32 = 20

案例2： grid_size = 1280/32 = 40

因此，在情况2中，您具有更高的并行度。但是，由于您只有2 SM，因此这可能是时间可能比预期的长的原因。换句话说，您有更多的块（40 * 40）等待要计算的两个SM。

如何为GPU和CPU串行版本的平均过滤器解释这些结果？

1 个答案: