愚蠢的测试

Question

虽然我可以直观地得到大部分结果，但我很难完全理解perf report命令的输出，特别是对于调用图的问题，所以我写了一个愚蠢的测试来解决这个问题。我一劳永逸。

愚蠢的测试

我编写了以下内容：

gcc -Wall -pedantic -lm perf-test.c -o perf-test

没有积极的优化来避免内联等。

#include <math.h>

#define N 10000000UL

#define USELESSNESS(n)                          \
    do {                                        \
        unsigned long i;                        \
        double x = 42;                          \
        for (i = 0; i < (n); i++) x = sin(x);   \
    } while (0)

void baz()
{
    USELESSNESS(N);
}

void bar()
{
    USELESSNESS(2 * N);
    baz();
}

void foo()
{
    USELESSNESS(3 * N);
    bar();
    baz();
}

int main()
{
    foo();
    return 0;
}

平面分析

perf record ./perf-test
perf report

有了这些，我得到了：

  94,44%  perf-test  libm-2.19.so       [.] __sin_sse2
   2,09%  perf-test  perf-test          [.] sin@plt
   1,24%  perf-test  perf-test          [.] foo
   0,85%  perf-test  perf-test          [.] baz
   0,83%  perf-test  perf-test          [.] bar

这听起来很合理，因为繁重的工作实际上是由__sin_sse2和sin@plt执行的，可能只是一个包装器，而我的函数的开销只考虑了整个循环：{{1} } 3*N的迭代，其他两个的foo。

分层分析

2*N

现在我得到的开销列是两个：perf record -g ./perf-test perf report -G perf report（输出默认按此排序）和Children（平面配置文件的相同开销）。

这是我开始觉得我错过了一些东西：不管我使用Self的事实，我无法用“x调用y”或“y调用y来解释层次结构” “，例如：

没有-G（“y由x调用”）：
```
-G
```
1. 为什么- 94,34% 94,06% perf-test libm-2.19.so [.] __sin_sse2 - __sin_sse2 + 43,67% foo + 41,45% main + 14,88% bar - 37,73% 0,00% perf-test perf-test [.] main main __libc_start_main - 23,41% 1,35% perf-test perf-test [.] foo foo main __libc_start_main - 6,43% 0,83% perf-test perf-test [.] bar bar foo main __libc_start_main - 0,98% 0,98% perf-test perf-test [.] baz - baz + 54,71% foo + 45,29% bar被__sin_sse2（间接？），main和foo而不是bar调用？
2. 为什么函数有时会附加百分比和层次结构（例如，baz的最后一个实例），有时不会（例如，baz的最后一个实例）？
bar（“x调用y”）：
```
-G
```
1. 我应该如何理解- 94,34% 94,06% perf-test libm-2.19.so [.] __sin_sse2 + __sin_sse2 + __libc_start_main + main - 37,73% 0,00% perf-test perf-test [.] main - main + 62,05% foo + 35,73% __sin_sse2 2,23% sin@plt - 23,41% 1,35% perf-test perf-test [.] foo - foo + 64,40% __sin_sse2 + 29,18% bar + 3,98% sin@plt 2,44% baz __libc_start_main main foo下的前三个条目？
2. __sin_sse2调用main即可，但为什么如果它调用foo和__sin_sse2（间接？），它也不会调用sin@plt和{ {1}}？
3. 为什么bar和baz出现在__libc_start_main下？为什么main会出现两次？

怀疑是这个层次结构有两个层次，其中第二层实际上代表“x调用y”/“y由x调用”语义，但我很想猜测所以我在这里问。文档似乎没有帮助。

对于这篇长篇文章感到抱歉，但我希望所有这些背景都可以为其他人提供帮助或作为参考。

Answer 1

好吧，让我们暂时忽略调用者和被调用者调用图之间的区别，主要是因为当我在我的机器上比较这两个选项之间的结果时，我只看到kernel.kallsyms内的效果DSO原因我不明白 - 我自己比较新。

我发现，对于您的示例，它更容易阅读整棵树。因此，使用--stdio，让我们查看__sin_sse2的整个树：

# Overhead    Command      Shared Object                  Symbol
# ........  .........  .................  ......................
#
    94.72%  perf-test  libm-2.19.so       [.] __sin_sse2
            |
            --- __sin_sse2
               |
               |--44.20%-- foo
               |          |
               |           --100.00%-- main
               |                     __libc_start_main
               |                     _start
               |                     0x0
               |
               |--27.95%-- baz
               |          |
               |          |--51.78%-- bar
               |          |          foo
               |          |          main
               |          |          __libc_start_main
               |          |          _start
               |          |          0x0
               |          |
               |           --48.22%-- foo
               |                     main
               |                     __libc_start_main
               |                     _start
               |                     0x0
               |
                --27.84%-- bar
                          |
                           --100.00%-- foo
                                     main
                                     __libc_start_main
                                     _start
                                     0x0

所以，我读到这个的方式是：44％的时间，sin来自foo;其中27％的时间来自baz，27％来自酒吧。

-g的文档很有启发性：

 -g [type,min[,limit],order[,key]], --call-graph
       Display call chains using type, min percent threshold, optional print limit and order. type can be either:

       ·   flat: single column, linear exposure of call chains.

       ·   graph: use a graph tree, displaying absolute overhead rates.

       ·   fractal: like graph, but displays relative rates. Each branch of the tree is considered as a new profiled object.

               order can be either:
               - callee: callee based call graph.
               - caller: inverted caller based call graph.

               key can be:
               - function: compare on functions
               - address: compare on individual code addresses

               Default: fractal,0.5,callee,function.

这里的重要部分是默认为分形，在分形模式下，每个分支都是一个新对象。

因此，您可以看到baz被调用的时间占50％，从bar调用，另外50％来自{foo 1}}。

这并非总是最有用的衡量标准，因此使用-g graph查看结果具有指导意义：

94.72%  perf-test  libm-2.19.so       [.] __sin_sse2
        |
        --- __sin_sse2
           |
           |--41.87%-- foo
           |          |
           |           --41.48%-- main
           |                     __libc_start_main
           |                     _start
           |                     0x0
           |
           |--26.48%-- baz
           |          |
           |          |--13.50%-- bar
           |          |          foo
           |          |          main
           |          |          __libc_start_main
           |          |          _start
           |          |          0x0
           |          |
           |           --12.57%-- foo
           |                     main
           |                     __libc_start_main
           |                     _start
           |                     0x0
           |
            --26.38%-- bar
                      |
                       --26.17%-- foo
                                 main
                                 __libc_start_main
                                 _start
                                 0x0

这会更改为使用绝对百分比，其中报告该调用链的每个百分比时间：因此foo->bar占总刻度的26％（反过来调用baz）和{{ 1}}（直接）是总刻度的12％。

从foo->baz的角度来看，我仍然不知道为什么我从来没有看到被调用者和来电者图之间存在任何差异。

更新

我从命令行改变的一件事是如何收集调用图。 Linux perf默认使用重构callstacks的帧指针方法。当编译器将__sin_sse2用作default时，这可能是一个问题。所以我用了

-fomit-frame-pointer

了解Linux perf报告输出

愚蠢的测试

平面分析

分层分析

1 个答案:

更新