Question

我正在使用UCLntu 14.04中的OpenCL，Geforce GTX550和驱动程序版本331.38。让我感到困惑的是从全局到本地内存的复制速度。据我所知，以下代码应该对全局内存进行合并访问：

void toLocal(__local float* target, const __global float* source, int count) {
    const int iterations = (count + get_local_size(0) - 1) / get_local_size(0);
    for (int i = 0; i < iterations; i++) {
        int idx = i * get_local_size(0) + get_local_id(0);
        if (idx < count)
            target[idx] = source[idx];
    }
}

在实践中，以下代码（应该使用所有线程一遍又一遍地复制相同的浮点数）明显更快：

void toLocal(__local float* target, const __global float* source, int count) {
    for (int i = 0; i < count; i++)
        target[i] = source[i];
}

源和目标都直接指向缓冲区的开头，所以我猜它们是正确对齐的。组大小是16乘16，尝试使用所有线程使代码更复杂，但不会影响速度。最佳合并组大小将是128个字节或32个浮点数，但据我所知，在计算模型2卡（GTX550是）上，仅使用一部分甚至置换元素的惩罚不应该是那么糟糕。将本地内存栏添加到第一个版本会使其变慢。还有什么我错过的吗？

编辑：将组大小更改为32乘32使得并行版本大致与序列16乘16一样快，并使顺序版本稍微慢一点。仍然不是我期待的速度提升。

在OpenCL / nVidia中缓慢的并行内存访问，我错过了什么？

0 个答案: