Question

我在ocl编程方面很新。

我有200万个多边形（每个4行 - 对于这个例子 - 但它是变体的）我需要找到1000个椭圆的交集。

我需要知道每个椭圆是否至少与一个多边形相交。

为此我创建了一个包含所有点的poylgons缓冲区，以及省略号缓冲区。

我的输出缓冲区是1000项int在所有项目中设置为0。并且内核将根据椭圆索引在右侧索引中设置1（当他找到交叉点时）。

我使用全局运行内核 - 2 dim，{2million，1000}。

__kernel void polygonsIntersectsEllipses(   __global const Point* pts, 
                                            __global const Ellipse* ellipses, 
                                            __global int* output)
{
    int polygonIdx = get_global_id(0);
    int ellipseIdx = get_global_id(1);

    if (<<isIntersects>>) {
        output[ellipseIdx] = 1;
    }
}

问题在于，一旦多边形中的一个与椭圆相交，我就不需要计算其余的多边形。

我曾尝试在交叉路口测试前检查output[ellipseIdx] != 0，但性能没有太大变化。

我试图做单个暗淡的全局 - 给1000（椭圆）并在内核中运行数百万个多边形并在我找到它时停止，但仍然没有那么多变化。

我做错了吗？我可以加快这项操作吗？任何提示？

修改

使用来自@Moises的提示并进行大量研究我将我的代码更改为运行200万次，单维度。使用小组工作项目。将我的所有结构更改为本机类型，跳过模数运算。基本上我可以将数据从全局复制到私有/本地内存，我做到了。

我的本地大小是我的设备CL_DEVICE_MAX_WORK_GROUP_SIZE，在我的cpu＆amp; gpu中是1024，所以在一次运行中我覆盖了我的所有1000个省略号。

主持人

size_t global = 1999872; // 2 million divided by 1024 - for the test
size_t local = 1024;

我的代码现在看起来像这样

 __kernel void polygonsIntersectsEllipses(  __global const float4* pts, 
                                            __global const float4* ellipses, 
                                            int ellipseCount,
                                            __local float4* localEllipses,
                                            __global int* output)
{
    // Saving the ellipses to local memory
    int localId = get_local_id(0);
    if (localId < eCount)
        localEllipses[localId] = ellipses[localId];

    barrier(CLK_LOCAL_MEM_FENCE);

    // Saving the current polygon into private memory
    int polygonIdx = get_global_id(0);
    float2 private_pts[5];
    for (int currPtsIdx = 0; currPtsIdx < 4; currPtsIdx++)
    {
        private_pts[currPtsIdx] = pts[polygonIdx * 4 + currPtsIdx];
    }

    // saving the last point as first so i will not use modulus for cycling, in the intersection algorithm
    private_pts[4] = private_pts[0];

    // Run over all the ellipse in the local memory including checking if already there is an output
    for (int ellipseIdx = 0; ellipseIdx < ellipseCount && output[ellipseIdx] == 0; ++ellipseIdx) {
       if (<<isIntersects Using localEllipses array and private_pts>>) {
           output[ellipseIdx] = 1;
       }
    }
}

结果

CPU没有那么大的改进 - 改变后的速度提高了1.1。

GPU - 6.5 倍快（我很激动）

我还有什么地方可以改善吗？提醒一旦多边形中的一个与椭圆相交，我们就不需要检查其余的多边形。我怎么做？询问输出值的诀窍并不是真的有效 - 无论是否有表现都是相同的

Answer 1

我明白你所有的2百万x1000个线程，读取自己的多边形数据和椭圆对吗？因此，对于每个多边形，每个线程读取相同内存位置的1000倍（使用多边形数据），不是吗？为了避免这种内存绑定行为，您只能创建2百万个线程并使用1000次迭代的for循环来迭代省略号的数量。或者是一个中间解决方案，具有2百万x64个线程的网格，其中每个线程为每个多边形计算16个椭圆。我不知道这些是否比你的解决方案更好，但它们避免了多余的内存访问。

此致莫伊塞斯

Answer 2

优化：

使用最少的内存量。保持布尔值的__global int* output太多，请改用char。甚至更好，使用二进制数组。（二进制操作与全局读取相比很快）

您不应该从每个线程output[ellipseIdx] == 0的全局内存中再次读取。这非常慢，相反，在开头用elipses数据保存到本地内存。注意：只有在找到匹配的组之后启动本地组才能从加速中受益。然而，这将节省大量的全局读取，这比保存一些计算要好得多。此外，本地组无法从中受益，因为当工作项找到匹配项时，所有本地工作项都已经处理了该椭圆。

__ kernel void polygonsIntersectsEllipses（

__global const float4* pts,
__global const float4* ellipses,
int ellipseCount,
__local float4* localEllipses,
__local char* localOuts,
__global char* output){

// Saving the ellipses to local memory
int localId = get_local_id(0);
if (localId < eCount)
    localOuts[localId] = output[localId];
barrier(CLK_LOCAL_MEM_FENCE);
if (localId < eCount && localOuts[localId]) // Do not copy elipses if we are not going to check them anyway
    localEllipses[localId] = ellipses[localId];

barrier(CLK_LOCAL_MEM_FENCE);

// Saving the current polygon into private memory
int polygonIdx = get_global_id(0);
float2 private_pts[5];
for (int currPtsIdx = 0; currPtsIdx < 4; currPtsIdx++)
{
    private_pts[currPtsIdx] = pts[polygonIdx * 4 + currPtsIdx];
}

// saving the last point as first so i will not use modulus for cycling, in the intersection algorithm
private_pts[4] = private_pts[0];

// Run over all the ellipse in the local memory including checking if already there is an output
for (int ellipseIdx = 0; ellipseIdx < ellipseCount; ++ellipseIdx) {
   if (localOuts[ellipseIdx] == 0){
       if (<<isIntersects Using localEllipses array and private_pts>>) {
           localOuts[ellipseIdx] = 1;
       }
       barrier(CLK_LOCAL_MEM_FENCE);
       if(localOuts[ellipseIdx] && localId == 0){
            output[ellipseIdx] = 1;
       }
   }
}

}

在opencl

2 个答案: