Question

我正在尝试用C语言优化代码，似乎一条指令占用了大约22％的时间。

该代码是使用gcc 8.2.0编译的。标志为-O3 -DNDEBUG -g和-Wall -Wextra -Weffc++ -pthread -lrt。

    509529.517218      task-clock (msec)         #    0.999 CPUs utilized
            6,234      context-switches          #    0.012 K/sec
               10      cpu-migrations            #    0.000 K/sec
        1,305,885      page-faults               #    0.003 M/sec
1,985,640,853,831      cycles                    #    3.897 GHz                      (30.76%)
1,897,574,410,921      instructions              #    0.96  insn per cycle           (38.46%)
  229,365,727,020      branches                  #  450.152 M/sec                    (38.46%)
   13,027,677,754      branch-misses             #    5.68% of all branches          (38.46%)
  604,340,619,317      L1-dcache-loads           # 1186.076 M/sec                    (38.46%)
   47,749,307,910      L1-dcache-load-misses     #    7.90% of all L1-dcache hits    (38.47%)
   19,724,956,845      LLC-loads                 #   38.712 M/sec                    (30.78%)
    3,349,412,068      LLC-load-misses           #   16.98% of all LL-cache hits     (30.77%)
  <not supported>      L1-icache-loads                                          
      129,878,634      L1-icache-load-misses                                         (30.77%)
  604,482,046,140      dTLB-loads                # 1186.353 M/sec                    (30.77%)
    4,596,384,416      dTLB-load-misses          #    0.76% of all dTLB cache hits   (30.77%)
        2,493,696      iTLB-loads                #    0.005 M/sec                    (30.77%)
       21,356,368      iTLB-load-misses          #  856.41% of all iTLB cache hits   (30.76%)
  <not supported>      L1-dcache-prefetches                                     
  <not supported>      L1-dcache-prefetch-misses                                

    509.843595752 seconds time elapsed

    507.706093000 seconds user
      1.839848000 seconds sys

VTune放大器向我提示了一个功能：https://pasteboard.co/IagrLaF.png

cmpq指令似乎占据了整个时间的22％。另一方面，其他说明花费的时间可以忽略不计。

perf给了我一些不同的印象，但我认为结果是一致的：

 Percent│       bool mapFound = false;
   0.00 │       movb   $0x0,0x7(%rsp)
        │     goDownBwt():
        │       bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
   0.00 │       lea    0x20(%rsp),%r12
        │         newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
   0.00 │       lea    (%rax,%rax,2),%rax
   0.00 │       shl    $0x3,%rax
   0.00 │       mov    %rax,0x18(%rsp)
   0.01 │       movzwl %dx,%eax
   0.00 │       mov    %eax,(%rsp)
   0.00 │     ↓ jmp    d6
        │       nop
        │       if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
   0.30 │ 88:   mov    (%rax),%rsi
   8.38 │       mov    0x10(%rsi),%rcx
   0.62 │       test   %rcx,%rcx
   0.15 │     ↓ je     1b0
        │         newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
   2.05 │       add    0x18(%rsp),%rcx
        │         ++stats->nDownPreprocessed;
   0.25 │       addq   $0x1,0x18(%rdx)
        │         newState->trace                = PREPROCESSED;
   0.98 │       movb   $0x10,0x30(%rsp)
        │         return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
  43.36 │       mov    0x8(%rcx),%rax
   2.61 │       cmp    %rax,(%rcx)
        │         newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
   0.05 │       mov    %rcx,0x20(%rsp)
        │         return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
   3.47 │       setbe  %dl

功能是

inline bool goDownBwt (state_t *previousState, unsigned short nucleotide, state_t *newState) {
  ++stats->nDown;
  if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
    ++stats->nDownPreprocessed;
    newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
    newState->trace                = PREPROCESSED;
    return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
  }
  bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
  newState->interval.k = bwt->L2[nucleotide] + newState->interval.k + 1;
  newState->interval.l = bwt->L2[nucleotide] + newState->interval.l;
  newState->trace = 0;
  return (newState->interval.k <= newState->interval.l);
}

state_t定义为

struct state_t {
  union {
    bwtinterval_t interval;
    preprocessedInterval_t *preprocessedInterval;
  };
  unsigned char trace;
  struct state_t *previousState;
};

preprocessedInterval_t是：

struct preprocessedInterval_t {
  bwtinterval_t interval;
  preprocessedInterval_t *firstChild;
};

几乎没有（〜1000）个state_t结构。但是，有很多（350k）preprocessedInterval_t对象已分配到其他位置。

第一个if的确是150亿乘以190亿的真实值。

在函数上用perf record -e branches,branch-misses mytool查找错误的分支会给我：

Available samples
2M branches                                                                                                                                                                                                       
1M branch-misses

我可以假定分支预测错误是造成这种速度下降的原因吗？优化我的代码的下一步是什么？

该代码可在GitHub

上找到

编辑1

valgrind --tool=cachegrind给我：

I   refs:      1,893,716,274,393
I1  misses:            4,702,494
LLi misses:              137,142
I1  miss rate:              0.00%
LLi miss rate:              0.00%

D   refs:        756,774,557,235  (602,597,601,611 rd   + 154,176,955,624 wr)
D1  misses:       39,489,866,187  ( 33,583,272,379 rd   +   5,906,593,808 wr)
LLd misses:        3,483,920,786  (  3,379,118,877 rd   +     104,801,909 wr)
D1  miss rate:               5.2% (            5.6%     +             3.8%  )
LLd miss rate:               0.5% (            0.6%     +             0.1%  )

LL refs:          39,494,568,681  ( 33,587,974,873 rd   +   5,906,593,808 wr)
LL misses:         3,484,057,928  (  3,379,256,019 rd   +     104,801,909 wr)
LL miss rate:                0.1% (            0.1%     +             0.1%  )

编辑2

我使用-O3 -DNDEBUG -march=native -fprofile-use进行了编译，并使用了命令perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss ./a.out

    508322.348021      task-clock (msec)         #    0.998 CPUs utilized
           21,592      context-switches          #    0.042 K/sec
               33      cpu-migrations            #    0.000 K/sec
        1,305,885      page-faults               #    0.003 M/sec
1,978,382,746,597      cycles                    #    3.892 GHz                      (44.44%)
  228,898,532,311      branches                  #  450.302 M/sec                    (44.45%)
   12,816,920,039      branch-misses             #    5.60% of all branches          (44.45%)
1,867,947,557,739      instructions              #    0.94  insn per cycle           (55.56%)
2,957,085,686,275      uops_issued.any           # 5817.343 M/sec                    (55.56%)
2,864,257,274,102      uops_executed.thread      # 5634.726 M/sec                    (55.56%)
    2,490,571,629      mem_load_uops_retired.l3_miss #    4.900 M/sec                    (55.55%)
   12,482,683,638      mem_load_uops_retired.l2_miss #   24.557 M/sec                    (55.55%)
   18,634,558,602      mem_load_uops_retired.l1_miss #   36.659 M/sec                    (44.44%)

    509.210162391 seconds time elapsed

    506.213075000 seconds user
      2.147749000 seconds sys

编辑3

我选择了提及我的功能的perf record -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss a.out的结果：

Samples: 2M of event 'task-clock', Event count (approx.): 517526250000
Overhead  Command     Shared Object       Symbol
  49.76%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 917K of event 'cycles', Event count (approx.): 891499601652
Overhead  Command     Shared Object       Symbol
  49.36%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 911K of event 'branches', Event count (approx.): 101918042567
Overhead  Command     Shared Object       Symbol
  43.01%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 877K of event 'branch-misses', Event count (approx.): 5689088740
Overhead  Command     Shared Object       Symbol
  50.32%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 1M of event 'instructions', Event count (approx.): 1036429973874
Overhead  Command     Shared Object       Symbol
  34.85%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 824K of event 'uops_issued.any', Event count (approx.): 1649042473560
Overhead  Command     Shared Object       Symbol
  42.19%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 802K of event 'uops_executed.thread', Event count (approx.): 1604052406075
Overhead  Command     Shared Object       Symbol
  48.14%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 13K of event 'mem_load_uops_retired.l3_miss', Event count (approx.): 1350194507
Overhead  Command     Shared Object      Symbol
  33.24%  srnaMapper  srnaMapper         [.] addState
  31.00%  srnaMapper  srnaMapper         [.] mapWithoutError

Samples: 142K of event 'mem_load_uops_retired.l2_miss', Event count (approx.): 7143448989
Overhead  Command     Shared Object       Symbol
  40.79%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 84K of event 'mem_load_uops_retired.l1_miss', Event count (approx.): 8451553539
Overhead  Command     Shared Object       Symbol
  39.11%  srnaMapper  srnaMapper          [.] mapWithoutError

（使用perf record --period 10000触发器Workload failed: No such file or directory。）

Answer 1

分支机构和分支机构未命中率相同吗？ 50％的错误预测率将非常糟糕。

https://perf.wiki.kernel.org/index.php/Tutorial#Period_and_rate解释说内核会动态调整每个计数器的周期，因此即使在极少数事件下，事件也会经常触发以获取足够的样本，但是您可以设置周期（有多少原始计数触发一个样本） perf record --period 10000就是这样做的，但是我还没有用过。

使用perf stat获取硬数字。更新：是的，您的perf stat结果确认您的分支错误预测率至少对于整个程序来说仅为5％，而不是50％。那仍然比您想要的要高（分支通常很频繁，而错误的预测很昂贵），但并不疯狂。

还可以查看L1d的缓存未命中率，也许还有mem_load_retired.l3_miss（和/或l2_miss和l1_miss）的信息，以查看是否确实丢失了负载。例如

perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,\
uops_issued.any,uops_executed.thread,\
mem_load_retired.l3_miss,mem_load_retired.l2_miss,mem_load_retired.l1_miss  ./a.out

您可以将这些事件中的任何一个与perf record一起使用，以获取一些统计数据样本，这些样本中的指令正在导致高速缓存未命中。这些是精确的事件（使用PEBS），因此应准确地映射到正确的指令（不像“周期”那样，计数归因于附近的某条指令，通常是停顿以等待ROB已满的输入，而不是一条指令）生产起来很慢。）

对于非PEBS事件，没有任何偏差，这些事件应该“怪罪”一条指令，但并不总是在正确的位置中断。

如果您要针对本地计算机进行优化，而不需要使其在其他任何地方运行，则可以使用-O3 -march=native。但这并不能解决高速缓存未命中的问题。

GCC配置文件引导的优化可以帮助它选择分支还是无分支。（gcc -O3 -march=native -fprofile-generate /使用一些实际的输入数据运行以生成配置文件输出/ gcc -O3 -march=native -fprofile-use）

我可以假定分支预测错误是造成这种速度下降的原因吗？

否，缓存未命中的可能性更高。您有大量的L3丢失，并且一直到DRAM花费了数百个核心时钟周期。分支预测可以隐藏某些正确预测的东西。

优化我的代码的下一步是什么？

如果可能的话，压缩数据结构，使更多的数据结构适合缓存，例如如果您不需要超过4GiB的虚拟地址空间，则使用32位指针（Linux x32 ABI：gcc -mx32）。或者，也许尝试在大型数组中使用32位无符号索引而不是原始指针，但这会导致负载使用时延稍差（在Sandybridge系列上有几个周期）。

和/或改善您的访问模式，因此您通常按顺序访问它们。因此，硬件预取可以在需要读取它们之前将它们带入缓存。

我对https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform或其在序列比对中的应用还不够熟悉，无法知道是否有可能提高效率，但是数据压缩本质上是有问题的，因为您经常需要依赖于数据的分支和访问分散的数据。不过，通常值得进行权衡，而更多的缓存未命中。

如何减少花在一条指令上的时间？

1 个答案: