Question

我正在使用一种算法，该算法对32位类型的给定索引执行许多popcount / sideways加法运算。我希望尽量减少执行当前已实现的操作所需的操作：

int popcto_test1(unsigned int bitmap[], int idx){
int i = 0,      // index
    count = 0;  // number of set bits
do {
    // Each node contains 8 bitmaps
    if(bitmap[i/32] & 1 << (i & 31)){
        ++count;
    }
    ++i;
} while (i < idx);

return count;
}

我知道for 64 bit types有点乱，但是对于32位类型，似乎没有一种快速的方法。

是否有更好的方法（较少的操作/最小的分支）？或者甚至是我可以尝试的替代方法，理想情况下是带有源的？

我从阅读类似的帖子中了解到，通常不建议使用这种优化，但是我的项目着重比较“优化” 的性能差异-以及它们是否提高了性能。

此后，我根据建议的方法以及上面进行的工作（测试4,000,000次）运行了一系列性能基准测试，并获得了以下结果：

avg popcto_test1 ns = 133
avg popcto_test2 //测试失败
avg popcto_test3 ns = 28
avg popcto_test4 ns = 74

测试功能如下：
测试失败2：

int popcto_test2(unsigned int bitmap[], int idx){
int i = 0,      // index
    count = 0;  // number of set bits
do {
    // Each node contains 8 bitmaps
    count += (bitmap[i/32] & (1 << (i & 31)));
    ++i;
} while (i < idx);

return count;
}

popcto_test3 ns = 28
关于这一点的（也许）有趣的一点是，尽管这是最快的，但是如果使用优化级别2或3（-O2 / -O3），则给出的结果是错误的。

int popcto_test3(unsigned int bitmap[], int idx){
int i = 0,      // index
    count = 0,  // number of set bits
    map = idx/32;
while (i < map){
    // Each node contains 8 bitmaps
    count += __builtin_popcount(bitmap[i]);
    ++i;
}

count += __builtin_popcount(bitmap[map] & ((1<<idx)-1));
return count;
}

avg popcto_test4 ns = 74 （彼得·韦格纳修正方法）

int popcto_test4(unsigned int bitmap[], int idx){
int i = 0,      // index
    j = 0,
    count = 0,  // number of set bits
    map = idx/32;
unsigned int temp = 0;

while (i < map){
    temp = bitmap[i];
    j = 0;
    while(temp){
        temp &= temp - 1;
        ++j;
    }
    count += j;
    ++i;
}
temp = bitmap[i] & ((1<<idx)-1);
j = 0;
while(temp){
    temp &= temp - 1;
    ++j;
}
return count + j;
}

Answer 1

感谢大家的建议，由于无法找到任何类似的测试，我决定对所有遇到的方法进行讨论。

N.B.显示的填充计数是针对最大argv[1]的索引，而不是argv[1]的弹出计数-8x 32位数组组成256位。 The code used to produce these results can be seen here.

在我的Ryzen 1700上，就我的使用而言，最快的人口计数是（通常）Software Optimization Guide for AMD64 Processors第180页上的那个。对于较大的人口数量，这（通常）仍然适用。

unsigned int population_count(int temp){
    // Software Optimization Guide for AMD64 Processors - Page 180
    temp = temp - ((temp >> 1) & 0x55555555);
    temp = (temp & 0x33333333) + ((temp >> 2) & 0x33333333);
    return (((temp + (temp >> 4)) & 0xF0F0F0F) * 0x1010101) >> 24;
}

我对此没有并列比较，但是如果您碰巧正在使用CUDA，内在__popc方法是最快的，随后是wegner方法。 AMD64方法是第二慢的方法（仅按位排列），我认为这是由于与其他所有方法相比占用率/寄存器使用率的增加。

对32位类型的计数位设置到给定位置

1 个答案: