在一个单词中用空格分隔比特的快速方法是什么?

时间:2017-12-07 10:28:27

标签: performance bitwise-operators

我在64位寄存器的下半部分有一个32位值;顶部为0(X表示有信息的位,从LSB到MSB列出的位):

X X X  ...  X 0 0 0 0 ... 0

现在,我想用信息“空出”这些位,以便我有

X 0 X 0 X 0 ... X 0

(或者如果你宁愿把0放在第一位,那么

0 X 0 X 0 X 0 ... X

也很好。)

快速的方法是什么?

多CPU架构相关的答案会很好,但是英特尔x86_64和/或nVIDIA Pascal SM的特定内容将是最相关的。

1 个答案:

答案 0 :(得分:3)

这称为Morton number,这是 parallel expand 的特定情况,在以下问题中反过来 compress right

一个通用解决方案可能是

uint64_t bit_expand(uint64_t x) // 00000000ABCDEFGH
{
    x = ((x & 0xFFFF0000) << 32) | ((x & 0x0000FFFF) << 16);        // ABCD0000EFGH0000
    x = (x & 0xFF000000FF000000) | ((x & 0x00FF000000FF0000) >> 8); // AB00CD00EF00GH00
    x = (x & 0xF000F000F000F000) | ((x & 0x0F000F000F000F00) >> 4); // A0B0C0D0E0F0G0H0
    x = (x & 0xC0C0C0C0C0C0C0C0) | ((x & 0x3030303030303030) >> 2);
    x = (x & 0xA0A0A0A0A0A0A0A0) | ((x & 0x5050505050505050) >> 1);
    return x;
}

然而,在RISC体系结构上,常量生成可能效率低下,因为64位立即数不能存储在x86上的单个指令中。即使在x86 the output assembly is quite long上。以下是Bit Twiddling Hacks

中描述的另一种可能的实现方式
static const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF};
static const unsigned int S[] = {1, 2, 4, 8};

unsigned int x; // Interleave lower 16 bits of x and y, so the bits of x
unsigned int y; // are in the even positions and bits from y in the odd;
unsigned int z; // z gets the resulting 32-bit Morton Number.  
                // x and y must initially be less than 65536.

x = (x | (x << S[3])) & B[3];
x = (x | (x << S[2])) & B[2];
x = (x | (x << S[1])) & B[1];
x = (x | (x << S[0])) & B[0];

y = (y | (y << S[3])) & B[3];
y = (y | (y << S[2])) & B[2];
y = (y | (y << S[1])) & B[1];
y = (y | (y << S[0])) & B[0];

z = x | (y << 1);

也可以使用lookup table

#define EXPAND4(a) ((((a) & 0x8) << 4) | (((a) & 0x4) << 2) \
                  | (((a) & 0x2) << 1) | (((a) & 0x1)))

const uint8_t LUT[16] = {
    EXPAND4( 0), EXPAND4( 1), EXPAND4( 2), EXPAND4( 3),
    EXPAND4( 4), EXPAND4( 5), EXPAND4( 6), EXPAND4( 7),
    EXPAND4( 8), EXPAND4( 9), EXPAND4(10), EXPAND4(11),
    EXPAND4(12), EXPAND4(13), EXPAND4(14), EXPAND4(15)
};

output = ((uint64_t)LUT[(x >> 28) & 0xF] << 56) | ((uint64_t)LUT[(x >> 24) & 0xF] << 48)
       | ((uint64_t)LUT[(x >> 20) & 0xF] << 40) | ((uint64_t)LUT[(x >> 16) & 0xF] << 32)
       | ((uint64_t)LUT[(x >> 12) & 0xF] << 24) | ((uint64_t)LUT[(x >>  8) & 0xF] << 16)
       | ((uint64_t)LUT[(x >>  4) & 0xF] <<  8) | ((uint64_t)LUT[(x >>  0) & 0xF] <<  0);

如有必要,可以增加查找表的大小

在带有BMI2的x86上,有PDEP指令的硬件支持,可以通过以下内在访问

output = _pdep_u64(x, 0xaaaaaaaaaaaaaaaaULL);

没有位存储/扩展指令但具有快速乘法器的架构的另一种解决方案

uint64_t spaceOut8bits(uint8_t b)
{
    uint64_t MAGIC = 0x8040201008040201;
    uint64_t MASK  = 0x8080808080808080;
    uint64_t expand8bits = htobe64(((MAGIC*b) & MASK) >> 7);
    uint64_t spacedOutBits = expand8bits*0x41041 & 0xAA000000AA000000;
    return (spacedOutBits | (spacedOutBits << 24)) & 0xFFFF000000000000;
}

uint64_t spaceOut64bits(uint64_t x)
{
    return (spaceOut8bits(x >> 24) >>  0)
         | (spaceOut8bits(x >> 16) >> 16)
         | (spaceOut8bits(x >>  8) >> 32)
         | (spaceOut8bits(x >>  0) >> 48);
}

它的工作方式就像这样

  • abcdefgh a0000000 b0000000 c0000000 d0000000 e0000000 f0000000 g0000000 h0000000 的第一步expands the input bits并存储在expand8bits
  • 然后我们通过在下一步中进行乘法和屏蔽将那些间隔开的位靠近在一起。之后spacedOutBits将包含 a0b0c0d0 00000000 00000000 00000000 e0f0g0h0 00000000 00000000 00000000 。我们将把结果中的两个字节合并在一起

使这些位更接近的神奇数字就像这样计算

  a0000000b0000000c0000000d0000000e0000000f0000000g0000000h0000000
×                                              1000001000001000001
  ────────────────────────────────────────────────────────────────
  a0000000b0000000c0000000d0000000e0000000f0000000g0000000h0000000
  00b0000000c0000000d0000000e0000000f0000000g0000000h0000000
+ 0000c0000000d0000000e0000000f0000000g0000000h0000000
  000000d0000000e0000000f0000000g0000000h0000000
  ────────────────────────────────────────────────────────────────
  a0b0c0d0b0c0d0e0c0d0e0f0d0e0f0g0e0f0g0h0f0g0h000g0h00000h0000000
& 1010101000000000000000000000000010101010000000000000000000000000
  ────────────────────────────────────────────────────────────────
  a0b0c0d0000000000000000000000000e0f0g0h0000000000000000000000000

可以看到输出程序集here。您可以更改编译器以查看它在各种体系结构上的完成情况

Bit Twiddling Hacks页面上还有另一种方法

z = ((x * 0x0101010101010101ULL & 0x8040201008040201ULL) * 
     0x0102040810204081ULL >> 49) & 0x5555 |
    ((y * 0x0101010101010101ULL & 0x8040201008040201ULL) * 
     0x0102040810204081ULL >> 48) & 0xAAAA;

可以在Portable efficient alternative to PDEP without using BMI2?

中找到更多解决方案

相关:How to do bit striping on pixel data?

正如您所看到的,如果没有位存款指令,它在操作方面会非常复杂。如果你不做这样的条带化处理,那么最好使用SIMD并行执行