NEON增加运行时间

时间:2017-03-12 17:19:42

标签: c++ neon

我目前正在尝试优化我的一些图像处理代码以使用NEON指令。

让我们说我必须使用非常大的浮点数组,并且我想将第一个的每个值乘以第二个的三个连续值。 (第二个是三倍大。)

float*     l_ptrGauss_pf32   = [...];
float*     l_ptrLaplace_pf32 = [...]; // Three times as large 

for (uint64_t k = 0; k < l_numPixels_ui64; ++k)
{
    float l_weight_f32 = *l_ptrGauss_pf32;

    *l_ptrLaplace_pf32 *= l_weight_f32;
    ++l_ptrLaplace_pf32;
    *l_ptrLaplace_pf32 *= l_weight_f32;
    ++l_ptrLaplace_pf32;
    *l_ptrLaplace_pf32 *= l_weight_f32;
    ++l_ptrLaplace_pf32;

    ++l_ptrGauss_pf32;
}

因此,当我用NEON内在函数替换上面的代码时,运行时间大约长10%。

float32x4_t l_gaussElem_f32x4;
float32x4_t l_laplElem1_f32x4;
float32x4_t l_laplElem2_f32x4;
float32x4_t l_laplElem3_f32x4;

for( uint64_t k=0; k<(l_lastPixelInBlock_ui64/4); ++k)
{
    l_gaussElem_f32x4 = vld1q_f32(l_ptrGauss_pf32);
    l_laplElem1_f32x4 = vld1q_f32(l_ptrLaplace_pf32);
    l_laplElem2_f32x4 = vld1q_f32(l_ptrLaplace_pf32+4);
    l_laplElem3_f32x4 = vld1q_f32(l_ptrLaplace_pf32+8);

    l_laplElem1_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem1_f32x4);
    l_laplElem2_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem2_f32x4);
    l_laplElem3_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem3_f32x4);

    vst1q_f32(l_ptrLaplace_pf32,   l_laplElem1_f32x4);
    vst1q_f32(l_ptrLaplace_pf32+4, l_laplElem2_f32x4);
    vst1q_f32(l_ptrLaplace_pf32+8, l_laplElem3_f32x4);

    l_ptrLaplace_pf32 += 12;
    l_ptrGauss_pf32   += 4;
}

使用Apple LLVM 8.0使用-Ofast编译这两个版本。即使没有NEON内在函数,编译器是否真的非常擅长优化此代码?

1 个答案:

答案 0 :(得分:0)

您的代码包含相对较多的向量加载操作和一些乘法操作。所以我建议优化载体的加载。有两个步骤:

  • 在阵列中使用对齐的内存。
  • 使用预取。

为了做到这一点,我建议使用下一个功能:

inline float32x4_t Load(const float * p)
{
    // use prefetch:
    __builtin_prefetch(p + 256); 
    // tell compiler that address is aligned:
    float * _p = (float *)__builtin_assume_aligned(p, 16);
    return vld1q_f32(_p);
}