Question

我是opencl的新手，现在我正致力于优化与OpenCL的模板匹配。我用一些较小的模板做了一些实验，发现我的OpenCL实现比OpenCV的CPU版本更快。但在这种特殊情况下，模板大小非常大（2048x2048），原始图像大小为（3072x3072），OpenCV cpu实现（137秒）远远领先于OpenCL（2000秒）。请建议一些方法来优化我的代码，如下所示。

void __kernel corrln(global const unsigned char* ref_image, global const 
unsigned char* template, global float* corrln )
{
    const uint Width = get_global_size(0);
    const int2 pos = {get_global_id(0), get_global_id(1)};

    float sum = 0;

    for(int y = pos.y; y < 2048; y++ )
    {
       for(int x =pos.x; x < 2048; x++ )
       {
          const int2 xy = { x, y };
          const int2 txy = { x - pos.x, y - pos.y };
          sum += ref_image[index(xy, Width)] * template[index(txy, 
                 2048)];
      }
   }

  corrln[index(pos, Width)]= sum;

}

Answer 1

考虑到您的ref_image合理大小小于2048（例如，1024x1024），并且ND大小等于ref_image大小，每个WI（工作项）都在进行不同数量的计算

带pos.x == 0 & pos.y == 0的WI在2个循环内进行2048 * 2048 = 4M计算，带pos.x == 1023 & pos.y == 1023的WI在2个循环内进行1024 * 1024M计算。单身WI的工作太多了。

尝试以这样的方式对此任务进行切换，即每个WI都会进行一些合理的固定数量的计算。比如，对于ref_image的第一列，执行多次内核启动，每个内核将处理16个右侧的列并计算＆amp;累积corrln数组，然后转到第二列，等等。

内核可能看起来像这样（仅用于说明!!!）：

void __kernel corrln(
    global const unsigned char* ref_image, 
    global const unsigned char* template, 
    global float* corrln ) 
{
    const uint Width = get_global_size(0);
    const int2 pos = {get_global_id(0), get_global_id(1)};
    uchar16 ref = vload16(index(xy, Width), ref_image);
    uchar16 tpl = vload16(index(xy, Width), template);
    float sum = corrln[index(pos, Width)] + dot(ref, tpl);
    corrln[index(pos, Width)]= sum;
}

与较大模板匹配的OpenCL模板比OpenCV CPU版本慢

1 个答案: