Question

我一直在阅读以下说明： http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/

以下内核应该减少一大块数据，而我只是不明白它的一部分。

while (global_index < length) ....  global_index += get_global_size(0)

我相信从连续布局的全局存储中读取数据会更聪明。意味着在k，k + 1，k + 2处读取数据比读取k + 1000，k + 2000，k + 3000更快。这不是他们在说global_index + = get_global_size（0）时所做的事情吗？

__kernel
void reduce(__global float* buffer,
            __local float* scratch,
            __const int length,
            __global float* result) {

  int global_index = get_global_id(0);
  float accumulator = INFINITY;
  // Loop sequentially over chunks of input vector
  while (global_index < length) {
    float element = buffer[global_index];
    accumulator = (accumulator < element) ? accumulator : element;
    global_index += get_global_size(0);
  }

  // Perform parallel reduction
  int local_index = get_local_id(0);
  scratch[local_index] = accumulator;
  barrier(CLK_LOCAL_MEM_FENCE);
  for(int offset = get_local_size(0) / 2;
      offset > 0;
      offset = offset / 2) {
    if (local_index < offset) {
      float other = scratch[local_index + offset];
      float mine = scratch[local_index];
      scratch[local_index] = (mine < other) ? mine : other;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
  }
  if (local_index == 0) {
    result[get_group_id(0)] = scratch[0];
  }
}

Answer 1

工作项0,1,2,3，...将首先并行读取缓冲区索引0,1,2,3，...（这通常是内存访问的最佳情况），然后是1000， 1001,1002,1003，...并行等等。

请记住，内核代码中的每条指令都将由所有工作项“并行”执行。

需要帮助了解OpenCL减少量

1 个答案: