以下内核应该减少一大块数据,而我只是不明白它的一部分。
while (global_index < length) .... global_index += get_global_size(0)
我相信从连续布局的全局存储中读取数据会更聪明。意味着在k,k + 1,k + 2处读取数据比读取k + 1000,k + 2000,k + 3000更快。这不是他们在说global_index + = get_global_size(0)时所做的事情吗?
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2;
offset > 0;
offset = offset / 2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
答案 0 :(得分:1)
工作项0,1,2,3,...将首先并行读取缓冲区索引0,1,2,3,...(这通常是内存访问的最佳情况),然后是1000, 1001,1002,1003,...并行等等。
请记住,内核代码中的每条指令都将由所有工作项“并行”执行。