Question

我正在编译一个C ++库，它定义了一个从一组数据点中随机采样的函数。数据点存储在std::vector中。有126,272个std::vector push_back语句，其中有问题的向量是double类型。编译需要很长时间。

为什么这需要这么长时间？（除了std::vector push_back语句之外的所有代码都需要不到1秒的时间来编译，因为其他代码很少。）

Answer 1

gcc中有-ftime-report选项，用于打印每个编译阶段浪费的详细时间报告。

我使用ubuntu 12.04 64位和gcc 4.6.3以及此代码重现您的情况：

#include <vector>
using namespace std;

int main()
{
  vector<double> d;

 d.push_back(5.7862517058766);
/* ... N lines generated with 
  perl -e 'print(" d.push_back(",rand(10),");\n") for 1..100000'
*/
 d.push_back(3.77195464257674);

  return d.size();
}

由于PC上的后台加载，各种N（[{1}}时间的-ftime-report输出不准确，因此请查看wall，user time）：

N = 10000

usr

N = 100000

$ g++ -ftime-report ./pb10k.cpp

Execution times (seconds)
...
 expand vars           :   1.48 (47%) usr   0.01 ( 7%) sys   1.49 (44%) wall    1542 kB ( 2%) ggc
 expand                :   0.11 ( 3%) usr   0.01 ( 7%) sys   0.10 ( 3%) wall   19187 kB (30%) ggc
...
 TOTAL                 :   3.18             0.15             3.35              64458 kB

因此，在“扩展变量”阶段，对于巨大的N还有一些额外的工作。此阶段恰好在此行中：cfgexpand.c:4463（在TV_VAR_EXPAND宏之间）。

有趣的事实：我使用自定义编译的32位g ++ 4.6.2编译时间非常短（N = 100000时约为20秒）。

我的g ++和ubuntu g ++有什么区别？一个是Ubuntu中的turning on by default Gcc堆栈保护（$ g++ -ftime-report ./pb100k.cpp Execution times (seconds) .... preprocessing : 0.49 ( 0%) usr 0.28 ( 5%) sys 0.59 ( 0%) wall 6409 kB ( 1%) ggc parser : 0.96 ( 0%) usr 0.39 ( 6%) sys 1.41 ( 0%) wall 108217 kB (18%) ggc name lookup : 0.06 ( 0%) usr 0.07 ( 1%) sys 0.24 ( 0%) wall 1023 kB ( 0%) ggc inline heuristics : 0.13 ( 0%) usr 0.00 ( 0%) sys 0.20 ( 0%) wall 0 kB ( 0%) ggc integration : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 4095 kB ( 1%) ggc tree gimplify : 0.22 ( 0%) usr 0.00 ( 0%) sys 0.23 ( 0%) wall 36068 kB ( 6%) ggc tree eh : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.14 ( 0%) wall 5678 kB ( 1%) ggc tree CFG construction : 0.08 ( 0%) usr 0.01 ( 0%) sys 0.10 ( 0%) wall 38544 kB ( 7%) ggc .... expand vars : 715.98 (97%) usr 1.62 (27%) sys 718.32 (83%) wall 18359 kB ( 3%) ggc expand : 1.04 ( 0%) usr 0.09 ( 1%) sys 1.64 ( 0%) wall 190836 kB (33%) ggc post expand cleanups : 0.09 ( 0%) usr 0.01 ( 0%) sys 0.15 ( 0%) wall 43 kB ( 0%) ggc .... rest of compilation : 1.94 ( 0%) usr 2.56 (43%) sys 102.42 (12%) wall 63620 kB (11%) ggc TOTAL : 739.68 6.01 866.46 586293 kB选项）。此保护仅添加到“扩展变量”阶段（在来源cfgexpand.c:1644,expand_used_vars()中找到;提到here）：

N = 100000，使用选项 -fstack-protect 禁用堆栈保护程序（将其用于您的代码）：

-fno-stack-protector

运行时间为24秒，低于800.

更新：

在$ g++ -ftime-report -fno-stack-protector pb100k.cpp 2>&1 |egrep 'TOTAL|expand vars' expand vars : 0.08 ( 0%) usr 0.01 ( 1%) sys 0.09 ( 0%) wall 18359 kB ( 3%) ggc TOTAL : 23.05 1.48 24.60 586293 kB内启动gcc（来自Valgrind的调用图分析工具）之后，我可以说有N个堆栈变量。如果启用了堆栈保护程序，则会在“扩展变量”阶段使用三个O（N ^ 2）算法处理它们。实际上有N ^ 2个成功的冲突检测和1.5 * N ^ 2位操作以及一些嵌套循环逻辑。

为什么堆栈变量的数量如此之高？因为代码中的每个双常量都保存到堆栈中的不同插槽中。然后它从它的槽加载并按照调用约定传递（通过x86中的堆栈顶部;通过x86_64中的寄存器）。有趣的事实是：使用callgrind或push_back编译的-fstack-protector的所有代码都是相同的;常量的堆栈布局也是一样的。只有一些非push_back代码的堆栈布局偏移受到影响（使用-fno-stack-protector和-S检查两次运行）。启用的堆栈保护程序没有创建其他代码。

启用堆栈保护程序会致命地更改编译器中的某些行为。不能说确切的位置（注意：有可能通过Juan M. Bello Rivas将堆栈跟踪与callgraph.tar.gz进行比较来找到这个转折点。）

第一个大的N *（N + 1）/ 2 = O（N ^ 2）walk在diff -u函数中设置有关堆栈变量对之间冲突的信息：

expand_used_vars_for_block (tree block, level)

/* Since we do not track exact variable lifetimes (which is not even possible for variables whose address escapes), we mirror the block tree in the interference graph. Here we cause all variables at this level, and all sublevels, to conflict. */ if (old_sv_num < this_sv_num) { new_sv_num = stack_vars_num; for (i = old_sv_num; i < new_sv_num; ++i) for (j = i < this_sv_num ? i : this_sv_num; j-- > old_sv_num ;) add_stack_var_conflict (i, j); } }转向

分配（每个变量一次）一个大小为......的位图，动态大小将增长到N位
在我的var的位掩码中设置一个与j
在j'th var的位掩码中设置一个与i

在add_stack_var_conflict(i,j)中有第二次N ^ 2步行。它使用add_alias_set_conflicts对每对进行类型检查。它检查两个变量是否属于同一类型（大多数对是;这是基于类型的别名分析，TBAA）。如果没有，则调用objects_must_conflict_p;这个N ^ 2循环嵌套只有N个这样的调用。

最后一次大走在add_stack_var_conflict函数中，partition_stack_vars()堆栈变量（O（NlogN））和N *（N-1）/ 2 = O（N ^ 2）步行查找所有非冲突对。这是来自cfgexpand.c文件的qsort的伪代码：

partition_stack_vars

函数Sort the objects by size. For each object A { S = size(A) O = 0 loop { Look for the largest non-conflicting object B with size <= S. /* There is a call to stack_var_conflict_p to check for * conflict between 2 vars */ UNION (A, B) offset(B) = O O += size(B) S -= size(B) } }只检查某个第i个变量中是否存在冲突位掩码，并且第j位设置为第j个变量的冲突标志（调用stack_var_conflict_p）。这里真正的坏消息是，callgrind说每次冲突检查都是成功的，并且每对都会跳过UNION逻辑。

因此，O（N ^ 2）位集和O（N ^ 2/2）位检查浪费了大量时间;所有这些工作都没有帮助优化任何事情。是的，这一切都是bitmap_bit_p(i->conflict_mask,j)的一部分，由-O0触发。

UPDATE2：

似乎，转折点是-fstack-protector cfgexpand.c from 4.6，检查堆栈上的立即或延迟分配变量：

expand_one_var

（根据callgrind，expand_one_stack_var仅在快速变体中调用）

启用1110 else if (defer_stack_allocation (var, toplevel)) 1111 add_stack_var (origvar); 1112 else 1113 { 1114 if (really_expand) 1115 expand_one_stack_var (origvar); 1116 return tree_low_cst (DECL_SIZE_UNIT (var), 1); 1117 }时强制执行延迟分配（有时需要重新排序所有堆栈变量）。甚至还有关于某些“二次问题”的评论，这对我们来说太熟悉了：

-fstack-protect

（堆栈分配也在969 /* A subroutine of expand_one_var. VAR is a variable that will be 970 allocated to the local stack frame. Return true if we wish to 971 add VAR to STACK_VARS so that it will be coalesced with other 972 variables. Return false to allocate VAR immediately. 973 974 This function is used to reduce the number of variables considered 975 for coalescing, which reduces the size of the quadratic problem. */ 976 977 static bool 978 defer_stack_allocation (tree var, bool toplevel) 979 { 980 /* If stack protection is enabled, *all* stack variables must be deferred, 981 so that we can re-order the strings to the top of the frame. */ 982 if (flag_stack_protect) 983 return true;及更高版本延迟）

这是一个提交：http://gcc.gnu.org/ml/gcc-patches/2005-05/txt00029.txt，它添加了这个逻辑。

Answer 2

osgx的答案完全回答了这个问题。

可能还有一个方面：push_back() vs初始化列表

当使用100000 push_backs运行上述测试时，我在Debian 6.0.6系统上使用gcc 4.4.6获得以下结果：

$ time g++ -std=c++0x -ftime-report ./pb100k.cc 

Execution times (seconds)
 garbage collection    :   0.55 ( 1%) usr   0.00 ( 0%) sys   0.55 ( 1%) wall       0 kB ( 0%) ggc
 ...
 reload                :  33.95 (58%) usr   0.13 ( 6%) sys  34.14 (56%) wall   65723 kB ( 9%) ggc
 thread pro- & epilogue:   0.66 ( 1%) usr   0.00 ( 0%) sys   0.66 ( 1%) wall      84 kB ( 0%) ggc
 final                 :   1.82 ( 3%) usr   0.01 ( 0%) sys   1.81 ( 3%) wall      21 kB ( 0%) ggc
 TOTAL                 :  58.65             2.13            60.92             737584 kB

real    1m2.804s
user    1m0.348s
sys     0m2.328s

使用初始化列表时，更多更快：

$ cat pbi100k.cc
#include <vector>
using namespace std;

int main()
{
   vector<double> d {
   0.190987822870774,
   /* 100000 lines with doubles generated with:
          perl -e 'print(rand(10),",\n") for 1..100000'
   */
   7.45608614801021};

  return d.size();
}

$ time g++ -std=c++0x -ftime-report ./pbi100k.cc 

Execution times (seconds)
 callgraph construction:   0.02 ( 2%) usr   0.00 ( 0%) sys   0.02 ( 1%) wall      25 kB ( 0%) ggc
 preprocessing         :   0.72 (59%) usr   0.06 (25%) sys   0.80 (54%) wall    8004 kB (12%) ggc
 parser                :   0.24 (20%) usr   0.12 (50%) sys   0.36 (24%) wall   43185 kB (65%) ggc
 name lookup           :   0.01 ( 1%) usr   0.05 (21%) sys   0.03 ( 2%) wall    1447 kB ( 2%) ggc
 tree gimplify         :   0.01 ( 1%) usr   0.00 ( 0%) sys   0.02 ( 1%) wall     277 kB ( 0%) ggc
 tree find ref. vars   :   0.01 ( 1%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall      15 kB ( 0%) ggc
 varconst              :   0.19 (15%) usr   0.01 ( 4%) sys   0.20 (14%) wall   11288 kB (17%) ggc
 integrated RA         :   0.02 ( 2%) usr   0.00 ( 0%) sys   0.02 ( 1%) wall      74 kB ( 0%) ggc
 reload                :   0.01 ( 1%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall      61 kB ( 0%) ggc
 TOTAL                 :   1.23             0.24             1.48              66378 kB

real    0m1.701s
user    0m1.416s
sys     0m0.276s

这大约快30倍！

Answer 3

我相信很长一段时间都与矢量是一个模板有关。编译器需要使用相应的函数重写push_back的每个出现。这就像有许多重载函数，编译需要进行名称修改以解决正确的函数问题。与简单编译非重载函数相比，这是一项额外的工作。

为什么编译超过100,000行的std :: vector :: push_back需要很长时间？

3 个答案: