Question

在How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?中，OP要求_mm256_movemask_epi8的反函数，但是使用SSE的_mm_movemask_ps()，是否有一个更简单的版本？这是我能想到的最好的方法，还不错。

__m128 movemask_inverse(int x) {
    __m128 m = _mm_setr_ps(x & 1, x & 2, x & 4, x & 8);
    return _mm_cmpneq_ps(m, _mm_setzero_ps());
}

Answer 1

反向移动掩码的效率在很大程度上取决于编译器。使用gcc大约需要21 instructions。

但是，使用clang -std=c99 -O3 -m64 -Wall -march=nehalem，代码可以很好地向量化，结果实际上并不太差：

movemask_inverse_original:              # @movemask_inverse_original
        movd    xmm0, edi
        pshufd  xmm0, xmm0, 0           # xmm0 = xmm0[0,0,0,0]
        pand    xmm0, xmmword ptr [rip + .LCPI0_0]
        cvtdq2ps        xmm1, xmm0
        xorps   xmm0, xmm0
        cmpneqps        xmm0, xmm1
        ret

尽管如此，您不需要cvtdq2ps整数即可进行浮点转换。在整数域中计算掩码更有效，并且投放（不进行转化）结果随后将浮动。彼得·科德斯（Peter Cordes）的回答：is there an inverse instruction to the movemask instruction in intel avx2?，讨论有关AVX2机箱的许多想法。这些想法中的大多数也可以以某种形式用于SSE案例。 LUT解决方案和ALU解决方案适合您的情况。

具有内在函数的ALU解决方案：

__m128 movemask_inverse_alternative(int x) {
    __m128i msk8421 = _mm_set_epi32(8, 4, 2, 1);
    __m128i x_bc = _mm_set1_epi32(x);
    __m128i t = _mm_and_si128(x_bc, msk8421);
    return _mm_castsi128_ps(_mm_cmpeq_epi32(x_bc, t));
}

使用gcc -std=c99 -O3 -m64 -Wall -march=nehalem生成的程序集：

movemask_inverse_alternative:
        movdqa  xmm1, XMMWORD PTR .LC0[rip]  % load constant 8, 4, 2, 1
        movd    xmm2, edi                    % move x from gpr register to xmm register
        pshufd  xmm0, xmm2, 0                % broadcast element 0 to element 3, 2, 1, 0
        pand    xmm1, xmm0                   % and with 8, 4, 2, 1
        pcmpeqd xmm0, xmm1                   % compare with 8, 4, 2, 1
        ret

函数内联后，movdqa可能会被吊出循环。

_mm_movemask_ps（）最快的逆是什么？

1 个答案: