我将图像的强度平均化为一个更大问题的简单测试用例。但是每次运行它时,我得到的结果都略有不同。相反,如果我在CPU上顺序运行相同的算法,结果是静态的。我们来看看GPU上的代码,
//util.cu
__global__ void avgImageDevice(float3 *avg, float3 *d_colorImageRGB, unsigned int width, unsigned int height)
{
const unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
const unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
if (x >= width || y >= height) return;
atomicAdd(&avg->x, d_colorImageRGB[y*width + x].x);
atomicAdd(&avg->y, d_colorImageRGB[y*width + x].y);
atomicAdd(&avg->z, d_colorImageRGB[y*width + x].z);
}
extern "C" void avgImage(float3 *avg, float3 *d_colorImageRGB, unsigned int width, unsigned int height)
{
const int T_PER_BLOCK = 16;
const dim3 blockSize((width + T_PER_BLOCK - 1) / T_PER_BLOCK, (height + T_PER_BLOCK - 1) / T_PER_BLOCK);
const dim3 gridSize(T_PER_BLOCK, T_PER_BLOCK);
avgImageDevice << <blockSize, gridSize >> >(avg, d_colorImageRGB, width, height);
}
CPU实现如下,
//main.cpp
#include <vector_types.h>
#include <opencv2\core\core.hpp>
#include <cuda_runtime.h>
#include <string>
extern "C" void avgImage(float3 *avg, float3 *d_colorImageRGB, unsigned int width, unsigned int height);
int main()
{
for(int k = 0 ; k < 100 ;++k)
{
//Initialization
Mat Image;
float3 avgCPU = make_float3(0, 0, 0);
float3 avgGPU = make_float3(0, 0, 0);
std::string filenameImage("/foo.jpg");
Image = imread(filenameImage, -1);
Image.convertTo(Image, CV_32FC3, 1.0f / 255);
//Copy to GPU global memory
cutilSafeCall(cudaMemcpy(d_albedoMapFilteredFloat3, Image.data, sizeof(float) * 3 * Image.size().width * Image.size().height, cudaMemcpyHostToDevice));
//Average on CPU
for (int x = 0; x < Image.size().width; ++x)
for (int y = 0; y < Image.size().height; ++y)
{
Vec3f intensity = Image.at<Vec3f>(y, x);
avgCPU += make_float3(intensity.val[0], intensity.val[1], intensity.val[2]);
}
avgCPU /= Image.size().width * Image.size().height;
//Average on GPU
float3 *d_avg;
cutilSafeCall(cudaMalloc(&d_avg, sizeof(float3)));
cutilSafeCall(cudaMemset(d_avg, 0, sizeof(float3)));
avgImage(d_avg, d_albedoMapFilteredFloat3, Image.size().width, Image.size().height);
cutilSafeCall(cudaMemcpy(&avgGPU, d_avg, sizeof(float3), cudaMemcpyDeviceToHost));
avgGPU /= Image.size().width * Image.size().height;
//Following values are consant across the iterations
printf("AVG CPU r: %.10f, g: %.10f, b: %.10f\n", avgCPU.x, avgCPU.y, avgCPU.z);
//Following values are different at every iteration
printf("AVG GPU r: %.10f, g: %.10f, b: %.10f\n", avgGPU.x, avgGPU.y, avgGPU.z);
}
}
因此,每对以下行应该匹配,并且是静态的。但它们不匹配,GPU结果不是静态的。
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325752139, g: 0.6762712002, b: 0.6835504174
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325753927, g: 0.6762660146, b: 0.6835544705
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325772405, g: 0.6762678027, b: 0.6835457087
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325744987, g: 0.6762621403, b: 0.6835452914
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325761080, g: 0.6762756109, b: 0.6835403442
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325756311, g: 0.6762655973, b: 0.6835408211
我有GTX 960,CUDA 6.5和Windows 7.这是一个数据争用问题吗?据我所知,atomicAdd
据报道没有任何全局记忆问题。
答案 0 :(得分:3)
结果可能取决于线程的调度顺序。实际上,根据图像大小,组件中的值,得到的平均值可能会从一次运行到另一次运行略有不同,尽管所有值都是正确的。如果num与运行不同,则代码的其他部分很可能存在问题。如果num相同,则所有结果都是正确的,最高可达IEEE-754标准。
答案 1 :(得分:3)
这不是数据竞赛。
浮点加法是可交换的:
a + b == b + a
但它不关联;有a,b,c这样:
(a + b) + c != a + (b + c)
个别添加的不同顺序(特别是它们的关联方式)会产生不同的结果。