Question

我正在搞乱我写的最糟糕的代码，（基本上是试图破解）并且我注意到这段代码：

for(int i = 0; i < N; ++i)
    tan(tan(tan(tan(tan(tan(tan(tan(x++))))))));
end
std::cout << x;

其中N是一个全局变量，运行速度明显慢于：

int N = 10000;
for(int i = 0; i < N; ++i)
    tan(tan(tan(tan(tan(tan(tan(tan(x++))))))));
end
std::cout << x;

使全局变量运行得慢的全局变量会怎样？

Answer 1

tl; dr ：本地版本将N保留在寄存器中，而全局版本则不然。使用const声明常量，无论你如何声明它都会更快。

以下是我使用的示例代码：

#include <iostream>
#include <math.h>
void first(){
  int x=1;
  int N = 10000;
  for(int i = 0; i < N; ++i)
    tan(tan(tan(tan(tan(tan(tan(tan(x++))))))));
  std::cout << x;
}
int N=10000;
void second(){
  int x=1;
  for(int i = 0; i < N; ++i)
    tan(tan(tan(tan(tan(tan(tan(tan(x++))))))));
  std::cout << x;
}
int main(){
  first();
  second();
}

（名为test.cpp）。

要查看生成的汇编程序代码，我运行了g++ -S test.cpp。

我收到了一个巨大的文件但有一些聪明的搜索（我搜索了棕褐色），我找到了我想要的东西：

来自first函数：

Ltmp2:
    movl    $1, -4(%rbp)
    movl    $10000, -8(%rbp) ; N is here !!!
    movl    $0, -12(%rbp)    ;initial value of i is here
    jmp LBB1_2       ;goto the 'for' code logic
LBB1_1:             ;the loop is this segment
    movl    -4(%rbp), %eax
    cvtsi2sd    %eax, %xmm0
    movl    -4(%rbp), %eax
    addl    $1, %eax
    movl    %eax, -4(%rbp)
    callq   _tan
    callq   _tan
    callq   _tan
    callq   _tan
    callq   _tan        
    callq   _tan
    callq   _tan
    movl    -12(%rbp), %eax
    addl    $1, %eax
    movl    %eax, -12(%rbp) 
LBB1_2:
    movl    -12(%rbp), %eax ;value of n kept in register 
    movl    -8(%rbp), %ecx  
    cmpl    %ecx, %eax  ;comparing N and i here
    jl  LBB1_1      ;if less, then go into loop code
    movl    -4(%rbp), %eax

第二个功能：

Ltmp13:
    movl    $1, -4(%rbp)    ;i
    movl    $0, -8(%rbp) 
    jmp LBB5_2
LBB5_1:             ;loop is here
    movl    -4(%rbp), %eax
    cvtsi2sd    %eax, %xmm0
    movl    -4(%rbp), %eax
    addl    $1, %eax
    movl    %eax, -4(%rbp)
    callq   _tan
    callq   _tan
    callq   _tan
    callq   _tan
    callq   _tan
    callq   _tan
    callq   _tan
    movl    -8(%rbp), %eax
    addl    $1, %eax
    movl    %eax, -8(%rbp)
LBB5_2:
    movl    _N(%rip), %eax  ;loading N from globals at every iteration, instead of keeping it in a register
    movl    -8(%rbp), %ecx

因此，从汇编代码中可以看到（或不是）在本地版本中，N在整个计算过程中保存在寄存器中，而在全局版本中，N在每次迭代时都从全局重新读取。 / p>

我认为发生这种情况的主要原因是线程等问题，编译器无法确定N是否未被修改。

如果你在N（const）的声明中添加const int N=10000，它甚至会比本地版本更快：

    movl    -8(%rbp), %eax
    addl    $1, %eax
    movl    %eax, -8(%rbp)
LBB5_2:
    movl    -8(%rbp), %eax
    cmpl    $9999, %eax ;9999 used instead of 10000 for some reason I do not know
    jle LBB5_1

N由常量替换。

Answer 2

无法优化全局版本以将其放入寄存器中。

Answer 3

我对@rtpg的问题和答案进行了一些实验，

试验问题

在文件main1.h中，全局N变量

int N = 10000;

然后在main1.c文件中，1000个计算的情况：

#include <stdio.h>
#include "sys/time.h"
#include "math.h"
#include "main1.h"



extern int N;

int main(){

        int k = 0;
        timeval static_start, static_stop;
        int x = 0;

        int y = 0;
        timeval start, stop;
        int M = 10000;

        while(k <= 1000){

                gettimeofday(&static_start, NULL);
                for (int i=0; i<N; ++i){
                        tan(tan(tan(tan(tan(tan(tan(tan(x++))))))));
                }
                gettimeofday(&static_stop, NULL);

                gettimeofday(&start, NULL);
                for (int j=0; j<M; ++j){
                        tan(tan(tan(tan(tan(tan(tan(tan(y++))))))));
                }
                gettimeofday(&stop, NULL);

                int first_interval = static_stop.tv_usec - static_start.tv_usec;
                int last_interval = stop.tv_usec - start.tv_usec;

                if(first_interval >=0 && last_interval >= 0){
                        printf("%d, %d\n", first_interval, last_interval);
                }

                k++;
        }

        return 0;
}

结果显示在以下直方图中（频率/微秒）：

the histogram for the comparison output time in both methods 红色框是非全局变量，基于循环（N）结束，而透明绿色M基于循环结束（非全局）。

有证据表明外部全球变量有点慢。

试验答案 @rtpg的原因很强大。从这个意义上讲，全局变量可能会更慢。

Speed of accessing local vs. global variables in gcc/g++ at different optimization levels

为了测试这个前提，我使用寄存器全局变量来测试性能。这是我的main1.h与全局变量

int N asm ("myN") = 10000;

新结果直方图：

Results with register global variable

结论当全局变量处于注册状态时，性能会有所提高。没有“全局”或“本地”变量问题。性能取决于对变量的访问。

Answer 4

我假设在编译上面的代码时，优化器不知道tan函数的内容。

即tan所做的事情是未知的 - 所有它知道的是将东西塞进堆栈，跳转到某个地址，然后清理堆栈。

在全局变量的情况下，编译器不知道tan对N的作用。在本地情况下，没有“{松散”指针或N可以合法获得的tan的引用：因此编译器知道N将采用什么值。

编译器可以平坦化循环 - 从完全（10000行的一个扁平块），部分（100个长度循环，每个100行），或者根本不变（每个长度10000循环1行），或者介于两者之间的任何事情。

当你的变量是本地变量时，编译器会更多地了解它，因为当它们是全局变量时，它几乎不知道它们如何变化，或者是谁读取变量。所以很少有假设。

有趣的是，这也是人类很难推理全局的原因。

Answer 5

我认为这可能是一个原因：由于全局变量存储在堆内存中，因此每次代码都需要访问堆内存。可能是因为上述原因代码运行缓慢。

全局变量会降低代码速度

5 个答案: