Question

我正在考虑如何在以下例程中加速位测试：

void histSubtractFromBits(uint64* cursor, uint16* hist){
    //traverse each bit of the 256-bit-long bitstring by splitting up into 4 bitsets
    std::bitset<64> a(*cursor);
    std::bitset<64> b(*(cursor+1));
    std::bitset<64> c(*(cursor+2));
    std::bitset<64> d(*(cursor+3));
    for(int bit = 0; bit < 64; bit++){
        hist[bit] -= a.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+64] -= b.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+128] -= c.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+192] -= d.test(bit);
    }
}

实际的gcc实现对位参数进行范围检查，然后使用位掩码对＆amp; -s进行范围检查。我可以在没有位集和我自己的位移/屏蔽的情况下完成它，但我相当肯定不会产生任何显着的加速（告诉我，如果我错了，为什么）。 / p>

我对x86-64程序集并不是很熟悉，但我知道某个bit test instruction，我知道它在理论上可能会inline assembly with gcc

1）您认为为上述代码编写内联汇编模拟是否值得？

2）如果是，那我该如何去做，也就是说你能给我一些基本的入门代码/样本，指出我正确的方向吗？

Answer 1

据我所知，你基本上遍历每一位。因此，我想像每次都应该提供良好的性能，简单地移动和屏蔽LSB。类似的东西：

uint64_t a = *cursor;
for(int bit = 0; a != 0; bit++, a >>= 1) {
    hist[bit] -= (a & 1);
}

或者，如果您只希望设置非常少的位并且对gcc特定内容感到满意，则可以使用__builtin_ffsll

uint64_t a = *cursor;
int next;
for(int bit = 0; (next = __builtin_ffsll(a)) != 0; ) {
    bit += next;
    hist[bit - 1] -= 1;
    a >>= next;
}

这个想法应该没问题，但不保证实际代码：）

更新：使用矢量扩展程序的代码：

typedef short v8hi __attribute__ ((vector_size (16)));

static v8hi table[256];

void histSubtractFromBits(uint64_t* cursor, uint16_t* hist)
{
    uint8_t* cursor_tmp = (uint8_t*)cursor;
    v8hi* hist_tmp = (v8hi*)hist;
    for(int i = 0; i < 32; i++, cursor_tmp++, hist_tmp++)
    {
        *hist_tmp -= table[*cursor_tmp];
    }
}

void setup_table()
{
    for(int i = 0; i < 256; i++)
    {
        for(int j = 0; j < 8; j++)
        {
            table[i][j] = (i >> j) & 1;
        }
    }
}

如果可以的话，这将被编译为SSE指令，例如我得到：

        leaq    32(%rdi), %rdx
        .p2align 4,,10
        .p2align 3
.L2:
        movzbl  (%rdi), %eax
        addq    $1, %rdi
        movdqa  (%rsi), %xmm0
        salq    $4, %rax
        psubw   table(%rax), %xmm0
        movdqa  %xmm0, (%rsi)
        addq    $16, %rsi
        cmpq    %rdx, %rdi
        jne     .L2

当然，这种方法依赖于缓存中的表。

Answer 2

另一个建议是组合数据缓存，寄存器和循环展开：

// Assuming your processor has 64-bit words
void histSubtractFromBits(uint64_t const * cursor, uint16* hist)
{
    register uint64_t a = *cursor++;
    register uint64_t b = *cursor++;
    register uint64_t c = *cursor++;
    register uint64_t d = *cursor++;
    register unsigned int i = 0;
    for (i = 0; i < (sizeof(*cursor) * CHAR_BIT; ++i)
    {
        hist[i +   0] += a & 1;
        hist[i +  64] += b & 1;
        hist[i + 128] += c & 1;
        hist[i + 192] += d & 1;
        a >>= 1;
        b >>= 1;
        c >>= 1;
        d >>= 1;
    }
}

我不确定你是否通过重新排序这样的指令来获得更多性能：

    hist[i +   0] += a & 1;
    a >>= 1;

您可以尝试两种方式并比较两者的汇编语言。

这里的一个想法是最大化寄存器使用。要测试的值被加载到寄存器中，然后开始测试。

如何加快位测试

2 个答案: