Question

所以我有一个图像的立方体。 512X512X512，我想以像素方式总结图像，并将其保存到最终的结果图像中。因此，如果所有像素都是值1 ...最终图像将全部为512.我无法理解在CUDA中执行此操作的索引。我认为一个线程的工作就是在它的像素上总结所有512 ...所以总线程数将是512X512。所以我计划用512个块来完成它，每个块有512个线程。从这里开始，我无法想出如何总结深度的索引。任何帮助将不胜感激。

Answer 1

解决此问题的一种方法是将立方体成像为一组Z幻灯片。坐标X，Y指的是图像的宽度和高度，以及Z维度中每个幻灯片的Z坐标。每个线程将在Z坐标中迭代以累积值。

考虑到这一点，配置一个内核来启动一个16x16线程的块和一个足够块的网格来处理图像的宽度和高度（我假设一个灰度图像，每个像素1个字节）：< / p>

#define THREADS 16
// kernel configuration
dim3 dimBlock = dim3 ( THREADS, THREADS, 1 );
dim3 dimGrid  = dim3 ( WIDTH / THREADS, HEIGHT / THREADS );
// call the kernel
kernel<<<dimGrid, dimBlock>>>(i_data, o_Data, WIDTH, HEIGHT, DEPTH);

如果您清楚如何索引2D数组，那么循环Z维也很清楚

__global__ void kernel(unsigned char* i_data, unsigned char* o_data, int WIDTH, int HEIGHT, int DEPTH)
{
  // in your kernel map from threadIdx/BlockIdx to pixel position
  int x = threadIdx.x + blockIdx.x * blockDim.x;
  int y = threadIdx.y + blockIdx.y * blockDim.y;
  // calculate the global index of a pixel into the image array
  // this global index is to the first slide of the cube
  int idx = x + y * WIDTH;

  // partial results
  int r = 0;

  // iterate in the Z dimension
  for (int z = 0; z < DEPTH; ++z)
  {
    // WIDTH * HEIGHT is the offset of one slide
    int idx_z = z * WIDTH*HEIGHT + idx;
    r += i_data[ idx_z ];
  }
  // o_data is a 2D array, so you can use the global index idx
  o_data[ idx ] = r;
}

这是一个天真的实现。为了最大化内存吞吐量，应正确对齐数据。

Answer 2

使用ArrayFire GPU库（免费）可以轻松完成。在ArrayFire中，您可以构建如下所示的3D数组：

两种方法：

// Method 1:
array data   = rand(x,y,z);
// Just reshaping the array, this is a noop
data = newdims(data,x*y, z, 1);

// Sum of pixels
res  = sum(data);

// Method 2:
// Use ArrayFire "GFOR"
array data   = rand(x,y,z);res = zeros(z,1);
gfor(array i, z) {
   res(ii) = sum(data(:,:,i);
}

使用CUDA查找一堆图像的像素平均值

2 个答案: