Question

执行sobel操作后，我有以下代码：

short* tempBufferVert = new short[width * height];
ippiFilterSobelVertBorder_8u16s_C1R(pImg, width, tempBufferVert, width * 2, dstSize, IppiMaskSize::ippMskSize3x3, IppiBorderType::ippBorderConst, 0, pBufferVert);
for (int i = 0; i < width * height; i++)
    tempBufferVert[i] >>= 2;

令人沮丧的是，位移是最长的操作，IPP sobel如此优化，运行速度比我的愚蠢位移更快。如何优化bitshift，或者是否有IPP或其他选项（AVX？）在整个内存上执行bithift（但属于short的符号，＆gt;＆gt; =在Visual Studio实现上执行）< / p>

Answer 1

首先确保您正在编译并启用优化（例如-O3），然后检查您的编译器是否自动向量化右移循环。如果不是那么你可以通过SSE获得显着改善：

#include <emmintrin.h> // SSE2

for (int i = 0; i < width * height; i += 8)
{
    __m128i v = _mm_loadu_si128((__m128i *)&tempBufferVert[i]);
    v = _mm_srai_epi16(v, 2); // v >>= 2
    _mm_storeu_si128((__m128i *)&tempBufferVert[i], v);
}

（注意：假设width*height是8的倍数。）

您可以通过一些循环展开和/或使用AVX2做得更好，但这可能足以满足您的需求。

Answer 2

C ++优化器使用基于迭代器的循环比使用索引循环执行得更好。

这是因为编译器可以假设地址算法在索引溢出时如何工作。为了在使用索引到数组时做出相同的假设，你必须碰巧选择正确的索引数据类型。

班次代码可表示为：

void shift(short* first, short* last, int bits)
{
  while (first != last) {
    *first++ >>= bits;
  }
}

int test(int width, int height)
{
  short* tempBufferVert = new short[width * height];
  shift(tempBufferVert, tempBufferVert + (width * height), 2);

}

哪个（启用了正确的优化）将被矢量化：https://godbolt.org/g/oJ8Boj

注意循环的中间部分如何：

.L76:
        vmovdqa ymm0, YMMWORD PTR [r9+rdx]
        add     r8, 1
        vpsraw  ymm0, ymm0, 2
        vmovdqa YMMWORD PTR [r9+rdx], ymm0
        add     rdx, 32
        cmp     rsi, r8
        ja      .L76
        lea     rax, [rax+rdi*2]
        cmp     rcx, rdi
        je      .L127
        vzeroupper

有效地移位整块内存

2 个答案: