Question

我最近一直在玩OpenCL，我能够编写仅使用全局内存的简单内核。现在我想开始使用本地内存，但我似乎无法弄清楚如何使用get_local_size()和get_local_id()一次计算一个“输出块”。

例如，假设我想将Apple的OpenCL Hello World示例内核转换为使用本地内存的内容。你会怎么做？这是原始的内核源代码：

__kernel square(
    __global float *input,
    __global float *output,
    const unsigned int count)
{
    int i = get_global_id(0);
    if (i < count)
        output[i] = input[i] * input[i];
}

如果这个例子不能轻易转换成显示如何使用本地内存的东西，那么任何其他简单的例子都可以。

Answer 1

查看NVIDIA或AMD SDK中的示例，它们应指向正确的方向。例如，矩阵转置将使用本地内存。

使用平方内核，您可以在中间缓冲区中暂存数据。请记住传递附加参数。

__kernel square(
    __global float *input,
    __global float *output,
    __local float *temp,
    const unsigned int count)
{
    int gtid = get_global_id(0);
    int ltid = get_local_id(0);
    if (gtid < count)
    {
        temp[ltid] = input[gtid];
        // if the threads were reading data from other threads, then we would
        // want a barrier here to ensure the write completes before the read
        output[gtid] =  temp[ltid] * temp[ltid];
    }
}

Answer 2

如果本地内存的大小不变，还有另一种可能性。不使用kernels参数列表中的指针，只需通过声明__local，就可以在内核中声明本地缓冲区：

__local float localBuffer[1024];

由于clSetKernelArg调用较少，因此删除了代码。

Answer 3

在OpenCL中，本地内存用于在工作组中的所有工作项之间共享数据。并且它通常需要在可以使用本地存储器数据之前执行屏障调用（例如，一个工作项想要读取由其他工作项写入的本地存储器数据）。屏障在硬件方面代价高昂。请记住，本地内存应该用于重复数据读/写。应尽可能避免银行冲突。

如果你不小心使用本地内存，那么使用全局内存可能会导致性能下降。

如何在OpenCL中使用本地内存？

3 个答案: