Question

有关优化以下代码的建议吗？代码首先灰度化，反转然后阈值图像（代码不包括在内，因为它是微不足道的）。然后它对每行和每列的元素求和（所有元素都是1或0）。然后，它会找到具有最高值的行和列的行和列索引。

代码应该找到图像的质心并且它可以工作，但我想让它更快

我正在开发API 23，因此无法使用还原内核。

Java片段：

private int[] sumValueY = new int[640];
private int[] sumValueX = new int[480];

rows_indices_alloc = Allocation.createSized( rs, Element.I32(rs), height, Allocation.USAGE_SCRIPT);
col_indices_alloc = Allocation.createSized( rs, Element.I32(rs), width, Allocation.USAGE_SCRIPT);

public RenderscriptProcessor(RenderScript rs, int width, int height)
{
   mScript.set_gIn(mIntermAllocation);

   mScript.forEach_detectX(rows_indices_alloc);
   mScript.forEach_detectY(col_indices_alloc);

   rows_indices_alloc.copyTo(sumValueX);
   col_indices_alloc.copyTo(sumValueY);
 }

Renderscript.rs片段：

#pragma version(1)
#pragma rs java_package_name(org.gearvrf.renderscript)
#include "rs_debug.rsh"
#pragma rs_fp_relaxed

const int mImageWidth=640;
const int mImageHeight=480;

int32_t maxsX=-1;
int32_t maxIndexX;

int32_t maxsY=-1;
int32_t maxIndexY;

rs_allocation gIn;

void detectX(int32_t v_in, int32_t x, int32_t y) {

    int32_t sum=0;

    for ( int i = 0; i < (mImageWidth); i++) {

       float4 f4 = rsUnpackColor8888(rsGetElementAt_uchar4(gIn, i, x));
       sum+=(int)f4.r;
    }

    if((sum>maxsX)){

        maxsX=sum;
        maxIndexX = x;
    }
}

void detectY(int32_t v_in, int32_t x, int32_t y) {

     int32_t sum=0;

     for ( int i = 0; i < (mImageHeight); i++) {

        float4 f4 = rsUnpackColor8888(rsGetElementAt_uchar4(gIn, x, i));
        sum+=(int)f4.r;
     }

     if((sum>maxsY)){
         maxsY=sum;
         maxIndexY = x;
     }

}

任何帮助将不胜感激

Answer 1

float4 f4 = rsUnpackColor8888(rsGetElementAt_uchar4(gIn, x, i));
sum+=(int)f4.r;

这会从int转换为float，然后再转换回int。我认为你可以通过这样做来简化：

sum += rsGetElementAt_uchar4(gIn, x, i).r;

我不确切知道你之前的阶段是如何工作的，因为你还没有发布它们，但你应该尝试生成打包值来阅读这里。因此，要么将灰度通道放在.rgba中，要么使用单通道格式，然后使用rsAllocationVLoad_uchar4一次获取4个值。

此外，尝试将前一阶段与此阶段结合起来，如果您不需要这些计算的中间结果，那么一次执行内存加载然后在寄存器中执行这些转换可能会更便宜。

您也可以使用线程操作的值。您可以尝试让每个内核处理宽度/ 2，宽度/ 4，宽度/ 8个元素并查看它们的执行方式。这将为GPU提供更多的线程，特别是在较低分辨率的图像上，但需要更多的还原步骤。

你还在maxsX / maxsY和maxIndexX / maxIndexY变量上有多作家竞争条件。如果你关心的是正确答案，所有这些写作都需要使用原子。我想也许你发布了错误的代码，因为你没有存储到* _indices_alloc，但是你最后复制了它们。所以，实际上你应该将所有的总和存储到那些，然后使用单线程函数或带有原子的内核来获得绝对最大和最大索引。

优化行和列单元格的renderscript总和

1 个答案: