Question

在尝试优化代码时，我对kcachegrdind和gprof生成的配置文件的差异感到有些困惑。具体来说，如果我使用gprof（使用-pg开关编译等），我有这个：

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 89.62      3.71     3.71   204626     0.02     0.02  objR<true>::R_impl(std::vector<coords_t, std::allocator<coords_t> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&) const
  5.56      3.94     0.23 18018180     0.00     0.00  W2(coords_t const&, coords_t const&)
  3.87      4.10     0.16   200202     0.00     0.00  build_matrix(std::vector<coords_t, std::allocator<coords_t> > const&)
  0.24      4.11     0.01   400406     0.00     0.00  std::vector<double, std::allocator<double> >::vector(std::vector<double, std::allocator<double> > const&)
  0.24      4.12     0.01   100000     0.00     0.00  Wrat(std::vector<coords_t, std::allocator<coords_t> > const&, std::vector<coords_t, std::allocator<coords_t> > const&)
  0.24      4.13     0.01        9     1.11     1.11  std::vector<short, std::allocator<short> >* std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<std::vector<short, std::alloca

这似乎暗示我不需要在::R_impl(...)

之外的任何地方寻找

同时，如果我在没有-pg开关的情况下编译并运行valgrind --tool=callgrind ./a.out，我会有一些不同的东西：这是kcachegrind输出的截图

enter image description here

如果我正确地解释了这一点，似乎表明::R_impl(...)只占用了大约50％的时间，而另一半则用于线性代数（Wrat(...)，eigenvalues和潜在的lapack调用）在gprof个人资料的下方。

我理解gprof和cachegrind使用不同的技巧，如果他们的结果有所不同，我也不会打扰。但在这里，它看起来非常不同，我不知道如何解释这些。有什么想法或建议吗？

Answer 1

您正在查看错误的列。你必须查看kcachegrind输出中的第二列，名为“self”。这是特定子程序花费的时间，而不考虑其子项。第一列有累积时间（它等于主要机器时间的100％）并且它不具备信息（在我看来）。

请注意，从kcachegrind的输出中可以看出，进程的总时间为53.64秒，而在子程序“R_impl”中花费的时间为46.72秒，这是总时间的87％。所以gprof和kcachegrind几乎完全一致。

Answer 2

gprof是检测的分析器，callgrind是 samples 分析器。使用检测的分析器，您可以获得每个功能进入和退出的开销，这可能会使配置文件产生偏差，特别是如果您具有相对较小的功能，这些功能被多次调用。采样分析器往往更准确 - 它们会略微降低整个程序的执行速度，但这往往会对所有函数产生相同的相对影响。

尝试对Zoom from RotateRight进行30天的免费评估 - 我怀疑它会为您提供一个与callgrind比gprof更多同意的个人资料。

gprof vs cachegrind profiles

2 个答案: