OpenCL内存优化 - 最近邻

时间:2012-10-22 03:54:05

标签: optimization opencl shared-memory gpgpu nearest-neighbor

我正在OpenCL中编写一个接收两个点数组的程序,并计算每个点的最近邻居。

我有两个程序。其中一个将计算4个维度的距离,并计算6个维度的距离。他们在下面:

4个维度:

kernel void BruteForce(
    global  read_only float4* m,
    global  float4* y,
    global write_only ushort* i,
    read_only uint mx)
{
    int index = get_global_id(0);
    float4 curY = y[index];

    float minDist = MAXFLOAT;
    ushort minIdx = -1;
    int x = 0;
    int mmx = mx;
    for(x = 0; x < mmx; x++)
    {
        float dist = fast_distance(curY, m[x]);
        if (dist < minDist)
        {
            minDist = dist;
            minIdx = x;
        }
    }
    i[index] = minIdx;
    y[index] = minDist;
}

6个维度:

kernel void BruteForce(
    global  read_only float8* m,
    global  float8* y,
    global write_only ushort* i,
    read_only uint mx)
{
    int index = get_global_id(0);
    float8 curY = y[index];

    float minDist = MAXFLOAT;
    ushort minIdx = -1;
    int x = 0;
    int mmx = mx;
    for(x = 0; x < mmx; x++)
    {
        float8 mx = m[x];
        float d0 = mx.s0 - curY.s0;
        float d1 = mx.s1 - curY.s1;
        float d2 = mx.s2 - curY.s2;
        float d3 = mx.s3 - curY.s3;
        float d4 = mx.s4 - curY.s4;
        float d5 = mx.s5 - curY.s5;

        float dist = sqrt(d0 * d0 + d1 * d1 + d2 * d2 + d3 * d3 + d4 * d4 + d5 * d5);
        if (dist < minDist)
        {
            minDist = dist;
            minIdx = index;
        }
    }
    i[index] = minIdx;
    y[index] = minDist;
}

我正在寻找为GPGPU优化此程序的方法。我通过使用本地内存阅读了一些关于GPGPU优化的文章(包括http://www.macresearch.org/opencl_episode6,它们附带了源代码)。我已经尝试应用它并想出了这段代码:

kernel void BruteForce(
    global  read_only float4* m,
    global  float4* y,
    global write_only ushort* i,
    __local float4 * shared)
{
    int index = get_global_id(0);
    int lsize = get_local_size(0);
    int lid = get_local_id(0);

    float4 curY = y[index];

    float minDist = MAXFLOAT;
    ushort minIdx = 64000;
    int x = 0;
    for(x = 0; x < {0}; x += lsize)
    {
        if((x+lsize) > {0}) 
            lsize = {0} - x;
        if ( (x + lid) < {0})
        {
            shared[lid] = m[x + lid];
        }
        barrier(CLK_LOCAL_MEM_FENCE);

        for (int x1 = 0; x1 < lsize; x1++)
        {
            float dist = distance(curY, shared[x1]);

            if (dist < minDist)
            {
                minDist = dist;
                minIdx = x + x1;
            }
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    i[index] = minIdx;
    y[index] = minDist;
}

我正在为我的'i'输出获得垃圾结果(例如,许多值相同)。有人能指出我正确的方向吗?我将非常感谢能帮助我改进此代码的任何答案,或者可能会发现上述优化版本的问题。

非常感谢你 Cauê

1 个答案:

答案 0 :(得分:1)

在这里获得大幅加速的一种方法是使用本地数据结构并一次计算整个数据块。您还应该只需要一个读/写全局向量(float4)。使用较小的块可以将相同的想法应用于6d版本。每个工作组都能够通过正在处理的数据块自由地工作。我将把确切的实现留给您,因为您将了解您的应用程序的具体细节。

一些伪ish代码(4d):

computeBlockSize is the size of the blocks to read from global and crunch.
this value should be a multiple of your work group size. I like 64 as a WG
size; it tends to perform well on most platforms. will be 
allocating 2 * float4 * computeBlockSize + uint * computeBlockSize of shared memory.
(max value for ocl 1.0 ~448, ocl 1.1 ~896)
#define computeBlockSize = 256 

__local float4[computeBlockSize] blockA;
__local float4[computeBlockSize] blockB;
__local uint[computeBlockSize] blockAnearestIndex;

now blockA gets computed against all blockB combinations. this is the job of a single work group.
*important*: only blockA ever gets written to. blockB is stored in local memory, but never changed or copied back to global

steps:
load blockA into local memory with async_work_group_copy
blockA is located at get_group_id(0) * computeBlockSize in the global vector
optional: set all blockA 'w' values to MAXFLOAT
optional: load blockAnearestIndex into local memory with async_work_group_copy if needed


need to compute blockA against itself first, then go into the blockB's
be careful to only write to blockA[j], NOT blockA[k]. j is exclusive to this work item
for(j=get_local_id(0); j<computeBlockSize;j++)
  for(k=0;k<computeBlockSize; k++)
    if(j==k) continue; //no self-comparison
    calculate distance of blockA[j] vs blockA[k]
    store min distance in blockA[j].w
    store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
barrier(local_mem_fence)

for (i=0;i<get_num_groups(0);i++)
  if (i==get_group_id(0)) continue;
  load blockB into local memory: async_work_group_copy(...)
  for(j=get_local_id(0); j<computeBlockSize;j++)
    for(k=0;k<computeBlockSize; k++)
      calculate distance of blockA[j] vs blockB[k]
      store min distance in blockA[j].w
      store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
  barrier(local_mem_fence)

write blockA and blockAnearestIndex to global memory using two async_work_group_copy

在读取blockB时应该没有问题,而另一个工作组写入相同的块(作为它自己的blockA),因为只有W值可能已经改变。如果碰巧遇到这个问题 - 或者如果你确实需要两个不同的点向量,你可以使用两个全局向量,就像你在上面一样,一个用A(可写),另一个用B(只读)。

当您的全局数据大小是computeBlockSize的倍数时,此算法效果最佳。为了处理边缘,我想到了两种解决方案。我建议为非方形边块创建第二个内核,其方式与上述类似。新内核可以在第一个之后执行,您可以保存第二个pci-e传输。或者,您可以使用-1的距离来表示两个元素的比较中的跳过(即,如果blockA [j] .w == -1或blockB [k] .w == -1,则继续)。第二种解决方案会导致内核中出现更多分支,这就是我建议编写新内核的原因。您的数据点中很小一部分实际上属于边缘块。