Question

我需要将一个__m128i变量（比如说v）移位m位，这样位就可以移动所有变量（因此，结果变量代表v * 2 ^ m）。这样做的最佳方式是什么？！

请注意_mm_slli_epi64单独调整v0和v1：

r0 := v0 << count
r1 := v1 << count

所以v0的最后几位错过了，但我想将这些位移到r1。

编辑：我正在寻找一个比这更快的代码（m <64）：

r0 = v0 << m;
r1 = v0 >> (64-m);
r1 ^= v1 << m;
r2 = v1 >> (64-m);

Answer 1

对于编译时恒定移位计数，您可以获得相当好的结果。否则不是真的。

这只是您问题中r0 / r1代码的SSE实现，因为没有其他明显的方法可以执行此操作。可变计数移位仅适用于向量元素内的位移，而不适用于整个寄存器的字节移位。所以我们只需将低64位传输到高64位，并使用可变计数移位将它们放在正确的位置。

// untested
#include <immintrin.h>

/* some compilers might choke on slli / srli with non-compile-time-constant args
 * gcc generates the   xmm, imm8 form with constants,
 * and generates the   xmm, xmm  form with otherwise.  (With movd to get the count in an xmm)
 */

// doesn't optimize for the special-case where count%8 = 0
// could maybe do that in gcc with if(__builtin_constant_p(count)) { if (!count%8) return ...; }
__m128i mm_bitshift_left(__m128i x, unsigned count)
{
    __m128i carry = _mm_bslli_si128(x, 8);   // old compilers only have the confusingly named _mm_slli_si128 synonym
    if (count >= 64)
        return _mm_slli_epi64(carry, count-64);  // the non-carry part is all zero, so return early
    // else
    carry = _mm_srli_epi64(carry, 64-count);  // After bslli shifted left by 64b

    x = _mm_slli_epi64(x, count);
    return _mm_or_si128(x, carry);
}

__m128i mm_bitshift_left_3(__m128i x) { // by a specific constant, to see inlined constant version
    return mm_bitshift_left(x, 3);
}
// by a specific constant, to see inlined constant version
__m128i mm_bitshift_left_100(__m128i x) { return mm_bitshift_left(x, 100);  }

我认为这不如原来那么方便。 _mm_slli_epi64适用于gcc / clang / icc，即使计数不是编译时常量（从整数reg生成movd到xmm reg）。有一个_mm_sll_epi64 (__m128i a, __m128i count)（请注意缺少i），但至少在这几天，i内在因素可以生成psllq的任何一种形式。

编译时常量计数版本效率很高，compiling to 4 instructions（或者没有AVX的5）：

mm_bitshift_left_3(long long __vector(2)):
        vpslldq xmm1, xmm0, 8
        vpsrlq  xmm1, xmm1, 61
        vpsllq  xmm0, xmm0, 3
        vpor    xmm0, xmm0, xmm1
        ret

Performance:

这在Intel SnB / IvB / Haswell上有3个周期延迟（vpslldq（1） - > vpsrlq（1） - > vpor（1）），吞吐量限制为每2个周期一个（使矢量移位单元饱和）在端口0）。字节移位在不同端口上的shuffle单元上运行。立即计数向量移位都是单uop指令，因此当与其他代码混合时，只有4个融合域uops占用管道空间。（可变计数向量移位为2 uop，2个周期延迟，因此此函数的可变计数版本比从计数指令看起来更差。）

或者对于计数＆gt; = 64：

mm_bitshift_left_100(long long __vector(2)):
        vpslldq xmm0, xmm0, 8
        vpsllq  xmm0, xmm0, 36
        ret

如果您的移位计数不是编译时常量，则必须按计数分支＆gt; 64，弄清楚是否向左或向右移动进位。我相信移位计数被解释为无符号整数，所以负数不可能。

它还需要额外的指令才能将int计数和64计数转换为向量寄存器。使用矢量比较和混合指令以无分支方式执行此操作可能是可能的，但分支可能是一个好主意。

GP寄存器中__uint128_t的可变计数版本看起来相当不错;比SSE版本更好。 Clang does a slightly better job than gcc, emitting fewer mov instructions，但它仍然使用两个cmov指令来计算＆gt; = 64个案例。（因为x86整数移位指令会掩盖计数，而不是饱和。）

__uint128_t leftshift_int128(__uint128_t x, unsigned count) {
    return x << count;  // undefined if count >= 128
}

Answer 2

在SSE4.A中，指令string screenWidth = Screen.PrimaryScreen.Bounds.Width.ToString(); string screenHeight = Screen.PrimaryScreen.Bounds.Height.ToString(); Label1.Text = ("Resolution: " + screenWidth + "x" + screenHeight);和insrq可用于一次移位（和旋转）__mm128i 1-64位。与8/16/32/64位对应的pextrN / pinsrX不同，这些指令在0到127之间的任何位偏移处选择或插入m位（1到64之间）。需要注意的是，长度和偏移的总和不得超过128。

转移__m128i的最佳方法是什么？

2 个答案: