Question

我有以下x86汇编代码：

  movl   8(%ebp), %edx  //get an argument from the caller
  movl   $0, %eax
  testl  %edx, %edx
  je     .L1            
.L2:                   // what's the purpose of this loop body?
  xorl   %edx, %eax
  shrl   $1, %edx
  jne    .L2
.L1:
  andl   $1, %eax

教科书给出的相应C代码如下

int f1(unsigned x)
{
    int y = 0;
    while(x != 0) {
        __________;
    }
    return __________;
 }

这本书要求读者填写空白并回答问题＆＃34;它做了什么？＆＃34;

我无法将循环体组合在一个C表达式中。我可以告诉循环体的作用，但我不知道它的用途。教科书还说％eax存储了返回值。那么......

的目的是什么？

andl  $1, %eax

我也不知道。

Answer 1

看起来整个循环的目的是在32位arg中将所有位异或。即计算parity。

从最后一条指令（and $1,%eax）向后工作，我们知道只有结果的低位才重要。

考虑到这一点，xor %edx,%eax变得更加清晰：xor %edx的当前低位到%eax。高垃圾并不重要。

shr循环，直到所有x的位都被移出。我们总是可以循环32次来获取所有的位，但是这比在x为0时停止效率低。（由于XOR如何工作，我们不需要在0位中实际的XOR;没有效果。）

一旦我们知道函数的作用，填充C就会成为巧妙/紧凑C语法的练习。我一开始认为y ^= (x>>=1);适合循环，但在第一次使用它之前会移动x 。

我在一个C语句中看到的唯一方法是使用,运算符（它确实引入了sequence point，因此可以安全地阅读左侧的x和在,的右侧修改它。所以，y ^= x, x>>=1;适合。

或者，对于更易读的代码，只需作弊并将两个语句与;放在同一行。

int f1(unsigned x) { int y = 0; while(x != 0) { y ^= x; x>>=1; } return y & 1; }

使用gcc5.3 -O3 on the Godbolt compiler explorer编译与问题中显示的asm基本相同。问题的代码de-optimizes the xor-zeroing idiom到mov $0, %eax，并优化了gcc对ret指令的愚蠢重复。（或者可能使用了早期版本的gcc而没有这样做。）

循环效率非常低：这是一种有效的方法：

我们不需要具有O（n）复杂度的循环（其中n是x的位宽）。相反，我们可以获得O（log2（n））复杂度，并且实际上利用x86技巧来完成前两个步骤。

我已经从操作数大小的后缀中删除了由寄存器确定的指令。（除了xorw使16位xor显式。）

#untested parity: # no frame-pointer boilerplate xor %eax,%eax # zero eax (so the upper 24 bits of the int return value are zeroed). And yes, this is more efficient than mov $0, %eax # so when we set %al later, the whole of %eax will be good. movzwl 4(%esp), %edx # load low 16 bits of `x`. (zero-extend into the full %edx is for efficiency. movw 4(%esp), %dx would work too. xorw 6(%esp), %dx # xor the high 16 bits of `x` # Two loads instead of a load + copy + shift is probably a win, because cache is fast. xor %dh, %dl # xor the two 8 bit halves, setting PF according to the result setnp %al # get the inverse of the CPU's parity flag. Remember that the rest of %eax is already zero, so the result is already zero-extended to 32-bits (int return value) ret

是的，这是正确的，x86 has a parity flag (PF)从每个“根据结果设置标志”的指令的结果的低8位更新，如xor。

我们使用np条件，因为PF = 1表示偶数奇偶校验：所有位的xor = 0.我们需要逆转0才能进行偶校验。

为了利用它，我们通过将高半部分降低到低半部分并进行组合，重复两次以将32位减少到8位来进行SIMD式水平缩减。

在设置标志的指令之前将eax归零（使用xor）比使用set-flags / setp %al / movzbl %al, %eax稍微有效，正如我在What is the best way to set a register to zero in x86 assembly: xor, mov or and?中所解释的那样。

或者，正如@EOF指出的那样，如果CPUID POPCNT feature bit is set，您可以使用popcnt并测试低位，以查看设置位数是偶数还是奇数。（另一种看待这种情况的方法：xor是add-without-carry，所以无论你将所有位放在一起还是将所有位水平加在一起，低位都是相同的。）

GNU C也有__builtin_parity和__builtin_popcnt如果告诉编译器编译目标支持它（使用-march=...或-mpopcnt），它会使用硬件指令，但是否则编译为目标机器的有效序列。英特尔内部函数总是编译为机器指令，而不是回退序列，如果没有相应的-mpopcnt目标选项，则使用它们是编译时错误。

不幸的是，gcc不会将纯C循环识别为奇偶校验计算并将其优化为此。一些编译器（如clang和gcc）可以识别某些popcount习语，并将它们优化为popcnt指令，但在这种情况下不会发生这种模式识别。：（

See these on godbolt

int parity_gnuc(unsigned x) { return __builtin_parity(x); } # with -mpopcnt, compiles the same as below # without popcnt, compiles to the same upper/lower half XOR algorithm I used, and a setnp # using one load and mov/shift for the 32->16 step, and still %dh, %dl for the 16->8 step. #ifdef __POPCNT__ #include <immintrin.h> int parity_popcnt(unsigned x) { return _mm_popcnt_u32(x) & 1; } #endif # gcc does compile this to the optimal code: popcnt 4(%esp), %eax and $1, %eax ret

另请参阅x86代码wiki中的其他链接。

什么是循环的目的＆＃34; xorl％edx，％eax; shrl $ 1，％edx＆＃34;？

1 个答案:

循环效率非常低：这是一种有效的方法：