Question

我正在尝试实现CUDA版本的算法，将图像转换为灰度。我有它的工作，但我有几个像素的麻烦。我发现我在GPU上进行的舍入操作之一与我在CPU上运行时的结果略有不同。我想知道是否有办法让我的GPU表现得和我的CPU完全一样。我已经尝试了一些nvcc编译标志（fzt = false / true和prec-div = false / true），但无济于事。我也尝试过对双打进行所有操作，因为它们应该更精确，但这也无济于事。这是我的CUDA内核，以及等效的顺序版本：

CUDA

_global__ void darkenImage(const unsigned char * inputImage,
    unsigned char * outputImage, const int width, const int height, int iteration){

  int x = ((blockIdx.x * blockDim.x) + (threadIdx.x + (iteration * MAX_BLOCKS * nrThreads)))%width;
  int y = ((blockIdx.x * blockDim.x) + (threadIdx.x + (iteration * MAX_BLOCKS * nrThreads)))/width;

  if(x < width && y < height){
    float grayPix = 0.0f;
    float r = static_cast< float >(inputImage[(y * width) + x]);
    float g = static_cast< float >(inputImage[(width * height) + (y * width) + x]);
    float b = static_cast< float >(inputImage[(2 * width * height) + (y * width) + x]);

    grayPix = fma(0.3f,r,fma(0.59f,g,(0.11f * b)));
    grayPix = fma(grayPix,0.6f,0.5f);


    outputImage[(y * width) + x] = static_cast< unsigned char >(grayPix);
  }
}

顺序

for(int x=0;x<width;x++){
    for(int y=0;y<height;y++){
      float grayPix = 0.0f;
      float r = static_cast< float >(inputImage[(y * width) + x]);
      float g = static_cast< float >(inputImage[(width * height) + (y * width) + x]);
      float b = static_cast< float >(inputImage[(2 * width * height) + (y * width) + x]);

      grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
      grayPix = (grayPix * 0.6f) + 0.5f;

      outputImage2[(y * width) + x] = static_cast< unsigned char >(grayPix);
    }
  }

我的所有跑步都是在Nvidia GTX 560-Ti或GTX 480上完成的，两者都应具备计算能力2.0。

此致莱纳斯

GPU和CPU上的舍入差异

0 个答案: