屏障如何用于OpenCl内核?

时间:2014-04-08 12:29:07

标签: opencl

内核代码:

#pragma OPENCL EXTENSION cl_khr_fp64: enable
#pragma OPENCL EXTENSION cl_amd_printf : enable

__kernel void calculate (__global double* in)
{
    int idx = get_global_id(0); // statement 1
    printf("started for %d workitem\n", idx); // statement 2
    in[idx] = idx + 100; // statement 3
    printf("value changed to %lf in %d workitem\n", in[idx], idx); // statement 4
    barrier(CLK_GLOBAL_MEM_FENCE); // statement 5
    printf("completed for %d workitem\n", idx); // statement 6
}

我使用clEnqueueNDRangeKernel调用内核,方法是将double数据类型的数组的参数传递给5个元素,其值初始化为0.0

我用5个global_work_size调用内核,因此数组i的每个元素将在每个工作项上解决。

但是根据我对障碍的理论理解,为了同步工作组中的工作项,OpenCL提供了类似的功能 具有屏障功能。这会强制工作项等待所有其他工作项 在小组中达到了障碍。通过创建障碍,您可以确保每个工作项都达到相同 指出它的处理。当工作项目需要完成时,这是一个至关重要的问题 计算将在未来计算中使用的中间结果。

因此,我期待输出如下:

started for 0 workitem
started for 1 workitem
value changed to 100.000000 in 0 workitem
value changed to 101.000000 in 1 workitem
started for 3 workitem
value changed to 103.000000 in 3 workitem
started for 2 workitem
value changed to 102.000000 in 2 workitem
started for 4 workitem
value changed to 104.000000 in 4 workitem

completed for 3 workitem
completed for 0 workitem
completed for 1 workitem
completed for 2 workitem
completed for 4 workitem

这些已完成的陈述,最终会在一起,因为障碍会限制其他工作项目,直到达到这一点。

但是,结果我得到了,

started for 0 workitem
value changed to 100.000000 in 0 workitem
completed for 0 workitem
started for 4 workitem
value changed to 104.000000 in 4 workitem
completed for 4 workitem
started for 1 workitem
started for 2 workitem
started for 3 workitem
value changed to 101.000000 in 1 workitem
value changed to 103.000000 in 3 workitem
completed for 3 workitem
value changed to 102.000000 in 2 workitem
completed for 2 workitem
completed for 1 workitem

我错过了逻辑方面的东西吗?那么,一个障碍如何为OpenCl内核工作?

在内核中添加了更多检查,以便在Barrier而不是print语句之后交叉检查更新的值。

#pragma OPENCL EXTENSION cl_khr_fp64: enable
#pragma OPENCL EXTENSION cl_amd_printf : enable

__kernel void calculate (__global double* in)
{
    int idx = get_global_id(0);
    in[idx] = idx + 100;
    barrier(CLK_GLOBAL_MEM_FENCE);
    if (idx == 0) {
        in[0] = in[4];
        in[1] = in[3];
        in[2] = in[2];
        in[3] = in[1];
        in[4] = in[0];
    }
}

然后在数组之后

after arr[0] = 104.000000
after arr[1] = 103.000000
after arr[2] = 102.000000
after arr[3] = 101.000000
after arr[4] = 100.000000

但结果,我得到了:

after arr[0] = 0.000000
after arr[1] = 101.000000
after arr[2] = 102.000000
after arr[3] = 103.000000
after arr[4] = 104.000000

2 个答案:

答案 0 :(得分:2)

是的,您错过了添加printf()使所有结果订单无效的事实。

事实上,OpenCL声明printf()的使用是实现定义的,In the case that printf is executed from multiple work-items concurrently, there is no guarantee of ordering with respect to written data.简单的逻辑将告诉您队列将被刷新以便每个WI,因为这是更简单的方法在并行执行填充了许多缓冲区(每个WI printf一个)之后序列化刷新。

它们按照您期望的顺序执行,但是在内核完成后,stdout的输出刷新发生,并且不遵循原始顺序。

答案 1 :(得分:2)

代码看起来非常好,我怀疑本地工作组的大小,如果你没有指定本地工作组大小,OpenCL编译器根据一些检查选择最佳(通常是 ONE )。

在调用

下面检查你的clEnqueueNDRangeKernel调用w.r.t.
size_t global_item_size = 5; //Specifies no. of total work items
size_t local_item_size = 5; // Specifies no. of work items per local group
clEnqueueNDRangeKernel( command_queue, kernel, 1, NULL, &global_item_size,    &local_item_size, 0, NULL, NULL );

注意: 此答案假设您未指定本地工作组大小或未根据您的要求正确设置。

关于工作组的更多信息::

屏障将阻止工作组中的所有线程,因为您没有指定工作组大小(其大小被视为一个),并且您将有5个工作组,每个工作组只有一个线程。