Question

我试图计算数字1的数量，是数组中的数字。

首先我在C lenguaje有一个代码（工作正常）：

int popcount2(int* array, int len){
    int i;
    unsigned x;
    int result=0;
    for (i=0; i<len; i++){
        x = array[i];
        do{
           result+= x & 0x1;
           x>>= 1;
       } while(x);
    }
return result;
}

现在我需要使用3-6行代码将do-while循环转换为Assembly。我写了一些代码，但结果不正确。（我在汇编世界中是新的）

int popcount3(int* array, int len){
int  i;
unsigned x;
int result=0;   
for (i=0; i<len; i++){
    x = array[i];
    asm(
    "ini3:               \n"
        "adc $0,%[r]     \n"
        "shr %[x]        \n"
        "jnz ini3        \n"

        : [r]"+r" (result)
        : [x] "r" (x)       );
  }
}

我在英特尔处理器上使用GCC（在Linux上）。

Answer 1

您开始使用效率非常低的算法 - 如果您使用更好的算法，那么您可能不需要浪费时间使用汇编程序。有关更有效的方法，请参阅Hacker's Delight和/或Bit Twiddling Hacks。

另请注意，较新的x86 CPU具有POPCNT指令，可以在一条指令中完成上述所有操作（您也可以call it via an intrinsic，因此不需要asm）。

最后gcc有一个内置的：__builtin_popcount，它可以满足您的所有需求 - 它将在较新的CPU上使用POPCNT，在较旧的CPU上使用等效的asm。

Answer 2

当我需要创建一个popcount时，我最终使用了Bit Twiddling Hacks @PaulR中提到的5和3的方法。但如果我想用循环做这个，可能是这样的：

#include <stdio.h>
#include <stdlib.h>

int popcount2(int v) {
   int result = 0;
   int junk;

   asm (
        "shr $1, %[v]      \n\t"   // shift low bit into CF
        "jz done           \n"     // and skip the loop if that was the only set bit
     "start:               \n\t"
        "adc $0, %[result] \n\t"   // add CF (0 or 1) to result
        "shr $1, %[v]      \n\t"
        "jnz start         \n"     // leave the loop after shifting out the last bit
     "done:                \n\t"
        "adc $0, %[result] \n\t"   // and add that last bit

        : [result] "+r" (result), "=r" (junk)
        : [v] "1" (v)
        : "cc"
   );

   return result;
}

int main(int argc, char *argv[])
{
   for (int x=0; x < argc-1; x++)
   {
      int v = atoi(argv[x+1]);

      printf("%d %d\n", v, popcount2(v));
   }
}

adc几乎总是比在CF上分支更有效。

"=r" (junk)是一个虚拟输出操作数，与v（"1"约束）位于同一寄存器中。我们使用它来告诉编译器asm语句会破坏v输入。我们可以使用[v] "+r"(v)来获取读写操作数，但我们不希望更新C变量v。

请注意，此实现的循环跳闸计数是最高设置位的位置。（bsr或32 - clz(v)）。 @ rcgldr的实现在每次迭代时清除最低设置位通常会在设置位数较低时更快但它们并非都接近整数的底部。

Answer 3

使用3-6行代码进行汇编。

此示例使用4指令循环：

popcntx proc    near
        mov     ecx,[esp+4]             ;ecx = value to popcnt
        xor     eax,eax                 ;will be popcnt
        test    ecx,ecx                 ;br if ecx == 0
        jz      popc1
popc0:  lea     edx,[ecx-1]             ;edx = ecx-1
        inc     eax                     ;eax += 1
        and     ecx,edx                 ;ecx &= (ecx-1)
        jnz     short popc0
popc1:  ret
popcntx endp

此示例使用3指令循环，但它比大多数处理器上的4指令循环版本慢。

popcntx proc    near
        mov     eax,[esp+4]             ;eax = value to popcnt
        mov     ecx,32                  ;ecx = max # 1 bits
        test    eax,eax                 ;br if eax == 0
        jz      popc1
popc0:  lea     edx,[eax-1]             ;eax &= (eax-1)
        and     eax,edx
        loopnz  popc0
popc1:  neg     ecx
        lea     eax,[ecx+32]
        ret
popcntx endp

这是一个替代的非循环示例：

popcntx proc    near
        mov     ecx,[esp+4]             ;ecx = value to popcnt
        mov     edx,ecx                 ;edx = ecx
        shr     edx,1                   ;mov upr 2 bit field bits to lwr
        and     edx,055555555h          ; and mask them
        sub     ecx,edx                 ;ecx = 2 bit field counts
                                        ; 0->0, 1->1, 2->1, 3->1
        mov     eax,ecx
        shr     ecx,02h                 ;mov upr 2 bit field counts to lwr
        and     eax,033333333h          ;eax = lwr 2 bit field counts
        and     ecx,033333333h          ;edx = upr 2 bit field counts
        add     ecx,eax                 ;ecx = 4 bit field counts
        mov     eax,ecx
        shr     eax,04h                 ;mov upr 4 bit field counts to lwr
        add     eax,ecx                 ;eax = 8 bit field counts
        and     eax,00f0f0f0fh          ; after the and
        imul    eax,eax,01010101h       ;eax bit 24->28 = bit count
        shr     eax,018h                ;eax bit 0->4 = bit count
        ret
popcntx endp

Answer 4

最好的想法是你可以使用内置的popcount函数作为suggested by Paul R，但由于你需要在汇编中编写它，这对我有用：

asm (
"start:                  \n"
        "and %0, %1      \n"
        "jz end          \n"
        "shr $0, %1      \n"
        "jnc start       \n"
        "inc %1          \n"
        "jmp start       \n"
"end:                    \n"
        : "+g" (result),
          "+r" (x)
        :
        : "cc"
);

在前两行，您只需检查x的内容（如果它为零Jump Zero则结束）。比你将x一位向右移动并且：

在移位操作结束时，CF flag包含从destinationoperand移出的最后一位。 *

如果没有CF设置，只需转到开始（Jump Not Carry），否则递增结果，然后转到开始。

美丽的装配思考是你可以用很多方式做事......

asm (
"start:                  \n"
        "shr $1, %1      \n"
        "jnc loop_cond   \n"
        "inc %0          \n"
        "and %1, %1      \n"
"loop_cond:              \n"
        "jnz start       \n"

        : "+g" (result),
          "+r" (x)
        :
        : "cc"
);

这里再次使用SHift Right指令，如果没有CF，则转到循环条件。

否则再次递增结果并调用二进制AND（INC 修改ZF）。

使用`LOOP`和`ECX`

我很好奇如何在3条指令中做到这一点（我认为如果不可能，你的老师不会给你3个底限）和我意识到x86也是有LOOP instruction：

每次执行LOOP指令时，计数寄存器递减，然后检查0.如果计数为0，则循环终止，程序继续执行LOOP指令之后的指令。如果计数不为零，则对目标（目标）操作数执行近跳转，这可能是循环开始时的指令。 *

您可以使用GCCs input constrain添加输入参数：

c - c注册。

asm (
"start:              \n"
    "shr $1, %1      \n"
    "adc $0, %0      \n"
    "loop start      \n"

    : "+g" (result)
    : "r" (x),
      "c" (8)             // Assuming 8b type (char)
);

只是为了确保它编译成正确的装配：

0x000000000040051f <+25>:   mov    $0x8,%ecx
0x0000000000400524 <+30>:   mov    -0x8(%rbp),%eax
0x0000000000400527 <+33>:   shr    %edx
0x0000000000400529 <+35>:   adc    $0x0,%eax
0x000000000040052c <+38>:   loop   0x400527 <main+33>

我认为第一个应该有更好的性能，特别是如果只设置1位，这种方法总是进行k*8迭代。

SSE4和单指令

我知道你必须使用一个循环，但只是为了好玩...使用SSE4 extension你可以通过一条指令来POPCNT：

该指令计算第二个操作数（源）中设置为1的位数，并返回第一个操作数（目标寄存器）中的计数。 *

我想想（我的笔记本上有一个相当旧的CPU，所以我无法为你测试）你应该只用一条简单的指令来做到这一点：

asm (   
    "POPCNT %1, %0   \n"
    : "=r" (result)
    : "mr" (x)
    : "cc"                                                                                                                                       
);

（如果你试试这个并且你有SSE4扩展，请告诉我它是否有效）

性能

我已经测量了将我的第一种和第二种方法与David Wohlferd's进行比较所需的100,000,000次弹出窗口所需的时间。 ^{[Raw data]}

+--------------+------------+------------+------------+
|              | 0x00000000 | 0x80000001 | 0xffffffff |
+--------------+------------+------------+------------+
| 1st solution |  0.543     |  5.040     |  3.833     |
| LOOP         | 11.530     | 11.523     | 11.523     |
| Davids       |  0.750     |  4.893     |  4.890     |
+--------------+------------+------------+------------+

如果有人可以将这3个与SSE4的POPCNT指令进行比较，我会很高兴。

汉明重量（数量为1）将C与组件混合

4 个答案:

使用`LOOP`和`ECX`

SSE4和单指令

性能

汉明重量（数量为1）将C与组件混合

4 个答案:

使用LOOP和ECX

SSE4和单指令

性能

使用`LOOP`和`ECX`