Question

我想使用OpenCL计算图像中的非零点总数。

由于这是一项添加工作，我使用了atom_inc。

此处显示了内核代码。

__kernel void points_count(__global unsigned char* image_data, __global int* total_number, __global int image_width)
{
    size_t gidx = get_global_id(0);
    size_t gidy = get_global_id(1);
    if(0!=*(image_data+gidy*image_width+gidx))
    {
        atom_inc(total_number);
    }
}

我的问题是，通过使用atom_inc，它会多余吗？

每当我们遇到非零点时，我们都应该等待atom_inc。

我有这样的想法，我们可以将整行分成数百个组，我们在不同的组中找到数字并最后添加它们。

如果我们可以这样做：

 __kernel void points_count(__global unsigned char* image_data, __global int* total_number_array, __global int image_width)
{
    size_t gidx = get_global_id(0);
    size_t gidy = get_global_id(1);
    if(0!=*(image_data+gidy*image_width+gidx))
    {
        int stepy=gidy%10;
        atom_inc(total_number_array+stepy);
    }    
}

我们会将整个问题分成更多组。在这种情况下，我们可以逐个在total_number_array中添加数字。

从理论上讲，它会有很好的性能提升吗？

那么，有没有人对这里的求和问题有一些建议？

谢谢！

Answer 1

如评论中所述，这是一个减少问题。

想法是保持单独的计数，然后在最后将它们重新组合在一起。

考虑使用本地内存来存储值。

声明每个工作组使用的本地缓冲区。
使用local_id作为索引来跟踪此缓冲区中出现的次数。
在执行结束时对这些值求和。

Answer 2

这里显示了使用Opencl对减少问题的一个非常好的介绍： http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/

还原内核可能如下所示（取自上面的链接）：

__kernel
void reduce(
            __global float* buffer,
            __local float* scratch,
            __const int length,
            __global float* result) {

  int global_index = get_global_id(0);
  int local_index = get_local_id(0);
  // Load data into local memory
  if (global_index < length) {
    scratch[local_index] = buffer[global_index];
  } else {
    // Infinity is the identity element for the min operation
    scratch[local_index] = INFINITY;
  }
  barrier(CLK_LOCAL_MEM_FENCE);
  for(int offset = get_local_size(0) / 2;
      offset > 0;
      offset >>= 1) {
    if (local_index < offset) {
      float other = scratch[local_index + offset];
      float mine = scratch[local_index];
      scratch[local_index] = (mine < other) ? mine : other;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
  }
  if (local_index == 0) {
    result[get_group_id(0)] = scratch[0];
  }
}

有关进一步说明，请参阅建议的链接。

OpenCL atom_inc分离有什么好主意吗？

2 个答案: