使用atomicAdd对CUDA上的图像求平均会产生不一致的结果

时间:2017-10-10 14:58:28

标签: image-processing graphics parallel-processing cuda computer-vision

我将图像的强度平均化为一个更大问题的简单测试用例。但是每次运行它时,我得到的结果都略有不同。相反,如果我在CPU上顺序运行相同的算法,结果是静态的。我们来看看GPU上的代码,

//util.cu
__global__ void avgImageDevice(float3 *avg, float3 *d_colorImageRGB, unsigned int width, unsigned int height)
{
    const unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    const unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
    if (x >= width || y >= height) return;

    atomicAdd(&avg->x, d_colorImageRGB[y*width + x].x);
    atomicAdd(&avg->y, d_colorImageRGB[y*width + x].y);
    atomicAdd(&avg->z, d_colorImageRGB[y*width + x].z);
}

extern "C" void avgImage(float3 *avg, float3 *d_colorImageRGB, unsigned int width, unsigned int height)
{
    const int T_PER_BLOCK = 16;
    const dim3 blockSize((width + T_PER_BLOCK - 1) / T_PER_BLOCK, (height + T_PER_BLOCK - 1) / T_PER_BLOCK);
    const dim3 gridSize(T_PER_BLOCK, T_PER_BLOCK);

    avgImageDevice << <blockSize, gridSize >> >(avg, d_colorImageRGB, width, height);
}

CPU实现如下,

//main.cpp

#include <vector_types.h>
#include <opencv2\core\core.hpp>
#include <cuda_runtime.h> 
#include <string>
extern "C" void avgImage(float3 *avg, float3 *d_colorImageRGB, unsigned int width, unsigned int height);

int main()
{
    for(int k = 0 ; k < 100 ;++k)
    {
        //Initialization
        Mat Image;
        float3 avgCPU = make_float3(0, 0, 0);
        float3 avgGPU = make_float3(0, 0, 0);
        std::string filenameImage("/foo.jpg");
        Image = imread(filenameImage, -1);
        Image.convertTo(Image, CV_32FC3, 1.0f / 255);

        //Copy to GPU global memory
        cutilSafeCall(cudaMemcpy(d_albedoMapFilteredFloat3, Image.data, sizeof(float) * 3 * Image.size().width * Image.size().height, cudaMemcpyHostToDevice));

        //Average on CPU
        for (int x = 0; x < Image.size().width; ++x)
            for (int y = 0; y < Image.size().height; ++y)
            {
                Vec3f intensity = Image.at<Vec3f>(y, x);
                avgCPU += make_float3(intensity.val[0], intensity.val[1], intensity.val[2]);
            }
        avgCPU /= Image.size().width * Image.size().height;


        //Average on GPU
        float3 *d_avg;
        cutilSafeCall(cudaMalloc(&d_avg, sizeof(float3)));
        cutilSafeCall(cudaMemset(d_avg, 0, sizeof(float3)));
        avgImage(d_avg, d_albedoMapFilteredFloat3, Image.size().width, Image.size().height);
        cutilSafeCall(cudaMemcpy(&avgGPU, d_avg, sizeof(float3), cudaMemcpyDeviceToHost));
        avgGPU /= Image.size().width * Image.size().height;

        //Following values are consant across the iterations
        printf("AVG CPU r: %.10f, g: %.10f, b: %.10f\n", avgCPU.x, avgCPU.y, avgCPU.z);

        //Following values are different at every iteration
        printf("AVG GPU r: %.10f, g: %.10f, b: %.10f\n", avgGPU.x, avgGPU.y, avgGPU.z);
    }
}

因此,每对以下行应该匹配,并且是静态的。但它们不匹配,GPU结果不是静态的。

AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325752139, g: 0.6762712002, b: 0.6835504174
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325753927, g: 0.6762660146, b: 0.6835544705
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325772405, g: 0.6762678027, b: 0.6835457087
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325744987, g: 0.6762621403, b: 0.6835452914
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325761080, g: 0.6762756109, b: 0.6835403442
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325756311, g: 0.6762655973, b: 0.6835408211

我有GTX 960,CUDA 6.5和Windows 7.这是一个数据争用问题吗?据我所知,atomicAdd据报道没有任何全局记忆问题。

2 个答案:

答案 0 :(得分:3)

结果可能取决于线程的调度顺序。实际上,根据图像大小,组件中的值,得到的平均值可能会从一次运行到另一次运行略有不同,尽管所有值都是正确的。如果num与运行不同,则代码的其他部分很可能存在问题。如果num相同,则所有结果都是正确的,最高可达IEEE-754标准。

答案 1 :(得分:3)

这不是数据竞赛。

浮点加法是可交换的:

a + b == b + a

但它关联;有a,b,c这样:

(a + b) + c != a + (b + c)

个别添加的不同顺序(特别是它们的关联方式)会产生不同的结果。