Question

问题：一个显然额外的代码行加速程序几乎两次。

这很难形成原始问题，它来自边界检查消除算法。所以，只是一些我无法理解的简单测试。

一个明显额外的代码行导致程序加速几乎两次。

有以下来源：

#include <stdlib.h>
#include <stdio.h>

int main(void)
{
   long i = 0, a = 0, x = 0;
   int  up = 200000000;

   int *values = malloc(sizeof(int)*up);

   for (i = 0; i < up ; ++i)
   {
        values[i]=i % 2;
   }
   for (i = 0; i < up  ; ++i)
   {
      x  =  (a & i);
#ifdef FAST
      x = 0;
#endif
      a += values[x];
   }
   printf ("a=%ld\n", a);
   return 0;
}/*main*/

在此示例中，'a'的值始终为0.该行 x = 0; 是额外的。

但是，（看 - 没有任何优化！）
$ gcc -O0 -o short short.c＆amp;＆amp;时间./short
a = 0时
真实0m2.808s
用户0m2.196s
sys 0m0.596s

$ gcc -O0 -DFAST -o short short.c＆amp;＆amp;时间./short
a = 0时
真实的0m1.869s
用户0m1.260s
sys 0m0.608s

而且，这在许多编译器/优化选项和程序变体上都是可重现的。

此外，除了将这个愚蠢的额外0放到某个寄存器之外，它们确实会产生相同的汇编代码！ E.g：

gcc -S -O0 -DFAST short.c＆amp;＆amp; mv short.s shortFAST.s
gcc -S -O0 short.c＆amp;＆amp; mv short.s shortSLOW.s
diff shortFAST.s shortSLOW.s
55d54
＆LT; movq $ 0，-24（％rbp）

并且，稍后 - 对某些（我能够测试）其他编译器/语言（包括Java JIT）的相同效果。唯一的共享 - x86-64架构。在英特尔和AMD处理器上都经过测试......

Answer 1

简短回答：存储0消除了其中一个循环中的写后读写依赖性。

<强>详情：

我认为这是一个有趣的问题，虽然你专注于O0优化级别，但在O3也可以看到相同的加速。但是看一下O0可以更容易地关注处理器正在做什么来优化代码而不是编译器，因为正如你所注意到的那样，得到的汇编代码只有1条指令不同。

感兴趣的循环的汇编代码如下所示

  movq  $0, -32(%rbp)
  jmp .L4
.L5:     
  movq  -32(%rbp), %rax
  movq  -24(%rbp), %rdx
  andq  %rdx, %rax     
  movq  %rax, -16(%rbp)
  movq  $0, -16(%rbp)     ;; This instruction in FAST but not SLOW
  movq  -16(%rbp), %rax
  leaq  0(,%rax,4), %rdx
  movq  -8(%rbp), %rax  
  addq  %rdx, %rax      
  movl  (%rax), %eax    
  cltq                  
  addq  %rax, -24(%rbp) 
  addq  $1, -32(%rbp) 
.L4:                    
  movl  -36(%rbp), %eax 
  cltq                  
  cmpq  -32(%rbp), %rax 
  jg  .L5

在我的系统上使用perf stat运行，我得到以下结果：

慢代码的结果

Performance counter stats for './slow_o0':

   1827.438670 task-clock                #    0.999 CPUs utilized          
           155 context-switches          #    0.085 K/sec                  
             1 CPU-migrations            #    0.001 K/sec                  
       195,448 page-faults               #    0.107 M/sec                  
 6,675,246,466 cycles                    #    3.653 GHz                    
 4,391,690,661 stalled-cycles-frontend   #   65.79% frontend cycles idle   
 1,609,321,845 stalled-cycles-backend    #   24.11% backend  cycles idle   
 7,157,837,211 instructions              #    1.07  insns per cycle        
                                         #    0.61  stalled cycles per insn
   490,110,757 branches                  #  268.195 M/sec                  
       178,287 branch-misses             #    0.04% of all branches        

   1.829712061 seconds time elapsed

快速代码的结果

 Performance counter stats for './fast_o0':

   1109.451910 task-clock                #    0.998 CPUs utilized          
            95 context-switches          #    0.086 K/sec                  
             1 CPU-migrations            #    0.001 K/sec                  
       195,448 page-faults               #    0.176 M/sec                  
 4,067,613,078 cycles                    #    3.666 GHz                    
 1,784,131,209 stalled-cycles-frontend   #   43.86% frontend cycles idle   
   438,447,105 stalled-cycles-backend    #   10.78% backend  cycles idle   
 7,356,892,998 instructions              #    1.81  insns per cycle        
                                         #    0.24  stalled cycles per insn
   489,945,197 branches                  #  441.610 M/sec                  
       176,136 branch-misses             #    0.04% of all branches        

   1.111398442 seconds time elapsed

所以你可以看到，即使＆＃34;快速＆＃34;代码执行更多的指令，它有更少的停顿。当无序CPU（像大多数x64架构一样）正在执行代码时，它会跟踪指令之间的依赖关系。如果操作数准备就绪，则可以通过另一条指令绕过等待指令。

在这个例子中，关键点可能是这个指令序列：

  andq  %rdx, %rax
  movq  %rax, -16(%rbp)
  movq  $0, -16(%rbp)     ;; This instruction in FAST but not SLOW
  movq  -16(%rbp), %rax  
  leaq  0(,%rax,4), %rdx
  movq  -8(%rbp), %rax

在快速代码中，movq -8(%rbp), %rax指令将从movq $0, -16(%rbp)转发给它的结果，并且它将能够更快地执行。而较慢的版本必须等待movq %rax, -16(%rbp)，它在循环迭代之间有更多的依赖关系。

请注意，如果不了解有关特定微体系结构的更多信息，则此分析可能过于简单化。但我怀疑其根本原因是这种依赖性，并且执行0（movq $0, -16(%rbp)指令）存储允许CPU在执行代码序列时执行更积极的推测。

gcc简单的算术循环性能

1 个答案: