内核代码:
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#pragma OPENCL EXTENSION cl_amd_printf : enable
__kernel void calculate (__global double* in)
{
int idx = get_global_id(0); // statement 1
printf("started for %d workitem\n", idx); // statement 2
in[idx] = idx + 100; // statement 3
printf("value changed to %lf in %d workitem\n", in[idx], idx); // statement 4
barrier(CLK_GLOBAL_MEM_FENCE); // statement 5
printf("completed for %d workitem\n", idx); // statement 6
}
我使用clEnqueueNDRangeKernel调用内核,方法是将double数据类型的数组的参数传递给5个元素,其值初始化为0.0
我用5个global_work_size调用内核,因此数组i的每个元素将在每个工作项上解决。
但是根据我对障碍的理论理解,为了同步工作组中的工作项,OpenCL提供了类似的功能 具有屏障功能。这会强制工作项等待所有其他工作项 在小组中达到了障碍。通过创建障碍,您可以确保每个工作项都达到相同 指出它的处理。当工作项目需要完成时,这是一个至关重要的问题 计算将在未来计算中使用的中间结果。
因此,我期待输出如下:
started for 0 workitem
started for 1 workitem
value changed to 100.000000 in 0 workitem
value changed to 101.000000 in 1 workitem
started for 3 workitem
value changed to 103.000000 in 3 workitem
started for 2 workitem
value changed to 102.000000 in 2 workitem
started for 4 workitem
value changed to 104.000000 in 4 workitem
completed for 3 workitem
completed for 0 workitem
completed for 1 workitem
completed for 2 workitem
completed for 4 workitem
这些已完成的陈述,最终会在一起,因为障碍会限制其他工作项目,直到达到这一点。
但是,结果我得到了,
started for 0 workitem
value changed to 100.000000 in 0 workitem
completed for 0 workitem
started for 4 workitem
value changed to 104.000000 in 4 workitem
completed for 4 workitem
started for 1 workitem
started for 2 workitem
started for 3 workitem
value changed to 101.000000 in 1 workitem
value changed to 103.000000 in 3 workitem
completed for 3 workitem
value changed to 102.000000 in 2 workitem
completed for 2 workitem
completed for 1 workitem
我错过了逻辑方面的东西吗?那么,一个障碍如何为OpenCl内核工作?
在内核中添加了更多检查,以便在Barrier而不是print语句之后交叉检查更新的值。
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#pragma OPENCL EXTENSION cl_amd_printf : enable
__kernel void calculate (__global double* in)
{
int idx = get_global_id(0);
in[idx] = idx + 100;
barrier(CLK_GLOBAL_MEM_FENCE);
if (idx == 0) {
in[0] = in[4];
in[1] = in[3];
in[2] = in[2];
in[3] = in[1];
in[4] = in[0];
}
}
然后在数组之后
after arr[0] = 104.000000
after arr[1] = 103.000000
after arr[2] = 102.000000
after arr[3] = 101.000000
after arr[4] = 100.000000
但结果,我得到了:
after arr[0] = 0.000000
after arr[1] = 101.000000
after arr[2] = 102.000000
after arr[3] = 103.000000
after arr[4] = 104.000000
答案 0 :(得分:2)
是的,您错过了添加printf()
使所有结果订单无效的事实。
事实上,OpenCL声明printf()
的使用是实现定义的,In the case that printf is executed from multiple work-items concurrently, there is no guarantee of ordering with respect to written data.
简单的逻辑将告诉您队列将被刷新以便每个WI,因为这是更简单的方法在并行执行填充了许多缓冲区(每个WI printf一个)之后序列化刷新。
它们按照您期望的顺序执行,但是在内核完成后,stdout的输出刷新发生,并且不遵循原始顺序。
答案 1 :(得分:2)
代码看起来非常好,我怀疑本地工作组的大小,如果你没有指定本地工作组大小,OpenCL编译器根据一些检查选择最佳(通常是 ONE 强>)。
在调用
下面检查你的clEnqueueNDRangeKernel调用w.r.t.size_t global_item_size = 5; //Specifies no. of total work items
size_t local_item_size = 5; // Specifies no. of work items per local group
clEnqueueNDRangeKernel( command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL );
注意: 此答案假设您未指定本地工作组大小或未根据您的要求正确设置。
关于工作组的更多信息::
屏障将阻止工作组中的所有线程,因为您没有指定工作组大小(其大小被视为一个),并且您将有5个工作组,每个工作组只有一个线程。