Question

通过调用__reduce（）可以减少大型数组;多次。

但是，以下代码仅使用两个阶段，并记录在案here：

但是我无法理解这两阶段缩减的算法。有些人能给出更简单的解释吗？

__kernel
void reduce(__global float* buffer,
        __local float* scratch,
        __const int length,
        __global float* result) {

    int global_index = get_global_id(0);
    float accumulator = INFINITY;
    // Loop sequentially over chunks of input vector
    while (global_index < length) {
        float element = buffer[global_index];
        accumulator = (accumulator < element) ? accumulator : element;
        global_index += get_global_size(0);
    }

    // Perform parallel reduction
    int local_index = get_local_id(0);
    scratch[local_index] = accumulator;
    barrier(CLK_LOCAL_MEM_FENCE);
    for(int offset = get_local_size(0) / 2; offset > 0; offset = offset / 2) {
        if (local_index < offset) {
            float other = scratch[local_index + offset];
            float mine = scratch[local_index];
            scratch[local_index] = (mine < other) ? mine : other;
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    if (local_index == 0) {
        result[get_group_id(0)] = scratch[0];
    }
}

使用CUDA也可以很好地实现它。

Answer 1

您创建N个帖子。第一个线程查看位置0，N，2 * N，...的值。第二个线程查看值1，N + 1,2 * N + 1，......这是第一个循环。它将length值减少为N值。

然后每个线程将其最小值保存在共享/本地内存中。然后你有一个同步指令（barrier(CLK_LOCAL_MEM_FENCE)。）然后你有共享/本地内存的标准减少。完成后，具有本地ID 0的线程将其结果保存在输出数组中。

总而言之，您的值从length减少到N/get_local_size(0)。完成此代码执行后，您需要执行最后一次传递。但是，这可以完成大部分工作，例如，您可能有长度~10 ^ 8，N = 2 ^ 16，get_local_size（0）= 256 = 2 ^ 8，此代码将10 ^ 8个元素减少为256个元素

你不明白哪些部分？

OpenCL / CUDA：两阶段缩减算法

1 个答案: