我正在尝试用C语言优化代码,似乎一条指令占用了大约22%的时间。
该代码是使用gcc 8.2.0编译的。标志为-O3 -DNDEBUG -g
和-Wall -Wextra -Weffc++ -pthread -lrt
。
509529.517218 task-clock (msec) # 0.999 CPUs utilized
6,234 context-switches # 0.012 K/sec
10 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,985,640,853,831 cycles # 3.897 GHz (30.76%)
1,897,574,410,921 instructions # 0.96 insn per cycle (38.46%)
229,365,727,020 branches # 450.152 M/sec (38.46%)
13,027,677,754 branch-misses # 5.68% of all branches (38.46%)
604,340,619,317 L1-dcache-loads # 1186.076 M/sec (38.46%)
47,749,307,910 L1-dcache-load-misses # 7.90% of all L1-dcache hits (38.47%)
19,724,956,845 LLC-loads # 38.712 M/sec (30.78%)
3,349,412,068 LLC-load-misses # 16.98% of all LL-cache hits (30.77%)
<not supported> L1-icache-loads
129,878,634 L1-icache-load-misses (30.77%)
604,482,046,140 dTLB-loads # 1186.353 M/sec (30.77%)
4,596,384,416 dTLB-load-misses # 0.76% of all dTLB cache hits (30.77%)
2,493,696 iTLB-loads # 0.005 M/sec (30.77%)
21,356,368 iTLB-load-misses # 856.41% of all iTLB cache hits (30.76%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
509.843595752 seconds time elapsed
507.706093000 seconds user
1.839848000 seconds sys
VTune放大器向我提示了一个功能:https://pasteboard.co/IagrLaF.png
cmpq
指令似乎占据了整个时间的22%。另一方面,其他说明花费的时间可以忽略不计。
perf
给了我一些不同的印象,但我认为结果是一致的:
Percent│ bool mapFound = false;
0.00 │ movb $0x0,0x7(%rsp)
│ goDownBwt():
│ bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
0.00 │ lea 0x20(%rsp),%r12
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.00 │ lea (%rax,%rax,2),%rax
0.00 │ shl $0x3,%rax
0.00 │ mov %rax,0x18(%rsp)
0.01 │ movzwl %dx,%eax
0.00 │ mov %eax,(%rsp)
0.00 │ ↓ jmp d6
│ nop
│ if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
0.30 │ 88: mov (%rax),%rsi
8.38 │ mov 0x10(%rsi),%rcx
0.62 │ test %rcx,%rcx
0.15 │ ↓ je 1b0
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
2.05 │ add 0x18(%rsp),%rcx
│ ++stats->nDownPreprocessed;
0.25 │ addq $0x1,0x18(%rdx)
│ newState->trace = PREPROCESSED;
0.98 │ movb $0x10,0x30(%rsp)
│ return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
43.36 │ mov 0x8(%rcx),%rax
2.61 │ cmp %rax,(%rcx)
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.05 │ mov %rcx,0x20(%rsp)
│ return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
3.47 │ setbe %dl
功能是
inline bool goDownBwt (state_t *previousState, unsigned short nucleotide, state_t *newState) {
++stats->nDown;
if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
++stats->nDownPreprocessed;
newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
newState->trace = PREPROCESSED;
return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
}
bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
newState->interval.k = bwt->L2[nucleotide] + newState->interval.k + 1;
newState->interval.l = bwt->L2[nucleotide] + newState->interval.l;
newState->trace = 0;
return (newState->interval.k <= newState->interval.l);
}
state_t
定义为
struct state_t {
union {
bwtinterval_t interval;
preprocessedInterval_t *preprocessedInterval;
};
unsigned char trace;
struct state_t *previousState;
};
preprocessedInterval_t
是:
struct preprocessedInterval_t {
bwtinterval_t interval;
preprocessedInterval_t *firstChild;
};
几乎没有(〜1000)个state_t
结构。但是,有很多(350k)preprocessedInterval_t
对象已分配到其他位置。
第一个if
的确是150亿乘以190亿的真实值。
在函数上用perf record -e branches,branch-misses mytool
查找错误的分支会给我:
Available samples
2M branches
1M branch-misses
我可以假定分支预测错误是造成这种速度下降的原因吗? 优化我的代码的下一步是什么?
该代码可在GitHub
上找到编辑1
valgrind --tool=cachegrind
给我:
I refs: 1,893,716,274,393
I1 misses: 4,702,494
LLi misses: 137,142
I1 miss rate: 0.00%
LLi miss rate: 0.00%
D refs: 756,774,557,235 (602,597,601,611 rd + 154,176,955,624 wr)
D1 misses: 39,489,866,187 ( 33,583,272,379 rd + 5,906,593,808 wr)
LLd misses: 3,483,920,786 ( 3,379,118,877 rd + 104,801,909 wr)
D1 miss rate: 5.2% ( 5.6% + 3.8% )
LLd miss rate: 0.5% ( 0.6% + 0.1% )
LL refs: 39,494,568,681 ( 33,587,974,873 rd + 5,906,593,808 wr)
LL misses: 3,484,057,928 ( 3,379,256,019 rd + 104,801,909 wr)
LL miss rate: 0.1% ( 0.1% + 0.1% )
编辑2
我使用-O3 -DNDEBUG -march=native -fprofile-use
进行了编译,并使用了命令perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss ./a.out
508322.348021 task-clock (msec) # 0.998 CPUs utilized
21,592 context-switches # 0.042 K/sec
33 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,978,382,746,597 cycles # 3.892 GHz (44.44%)
228,898,532,311 branches # 450.302 M/sec (44.45%)
12,816,920,039 branch-misses # 5.60% of all branches (44.45%)
1,867,947,557,739 instructions # 0.94 insn per cycle (55.56%)
2,957,085,686,275 uops_issued.any # 5817.343 M/sec (55.56%)
2,864,257,274,102 uops_executed.thread # 5634.726 M/sec (55.56%)
2,490,571,629 mem_load_uops_retired.l3_miss # 4.900 M/sec (55.55%)
12,482,683,638 mem_load_uops_retired.l2_miss # 24.557 M/sec (55.55%)
18,634,558,602 mem_load_uops_retired.l1_miss # 36.659 M/sec (44.44%)
509.210162391 seconds time elapsed
506.213075000 seconds user
2.147749000 seconds sys
编辑3
我选择了提及我的功能的perf record -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss a.out
的结果:
Samples: 2M of event 'task-clock', Event count (approx.): 517526250000
Overhead Command Shared Object Symbol
49.76% srnaMapper srnaMapper [.] mapWithoutError
Samples: 917K of event 'cycles', Event count (approx.): 891499601652
Overhead Command Shared Object Symbol
49.36% srnaMapper srnaMapper [.] mapWithoutError
Samples: 911K of event 'branches', Event count (approx.): 101918042567
Overhead Command Shared Object Symbol
43.01% srnaMapper srnaMapper [.] mapWithoutError
Samples: 877K of event 'branch-misses', Event count (approx.): 5689088740
Overhead Command Shared Object Symbol
50.32% srnaMapper srnaMapper [.] mapWithoutError
Samples: 1M of event 'instructions', Event count (approx.): 1036429973874
Overhead Command Shared Object Symbol
34.85% srnaMapper srnaMapper [.] mapWithoutError
Samples: 824K of event 'uops_issued.any', Event count (approx.): 1649042473560
Overhead Command Shared Object Symbol
42.19% srnaMapper srnaMapper [.] mapWithoutError
Samples: 802K of event 'uops_executed.thread', Event count (approx.): 1604052406075
Overhead Command Shared Object Symbol
48.14% srnaMapper srnaMapper [.] mapWithoutError
Samples: 13K of event 'mem_load_uops_retired.l3_miss', Event count (approx.): 1350194507
Overhead Command Shared Object Symbol
33.24% srnaMapper srnaMapper [.] addState
31.00% srnaMapper srnaMapper [.] mapWithoutError
Samples: 142K of event 'mem_load_uops_retired.l2_miss', Event count (approx.): 7143448989
Overhead Command Shared Object Symbol
40.79% srnaMapper srnaMapper [.] mapWithoutError
Samples: 84K of event 'mem_load_uops_retired.l1_miss', Event count (approx.): 8451553539
Overhead Command Shared Object Symbol
39.11% srnaMapper srnaMapper [.] mapWithoutError
(使用perf record --period 10000
触发器Workload failed: No such file or directory
。)
答案 0 :(得分:4)
分支机构和分支机构未命中率相同吗? 50%的错误预测率将非常糟糕。
https://perf.wiki.kernel.org/index.php/Tutorial#Period_and_rate解释说内核会动态调整每个计数器的周期,因此即使在极少数事件下,事件也会经常触发以获取足够的样本,但是您可以设置周期(有多少原始计数触发一个样本) perf record --period 10000
就是这样做的,但是我还没有用过。
使用perf stat
获取硬数字。更新:是的,您的perf stat
结果确认您的分支错误预测率至少对于整个程序来说仅为5%,而不是50%。那仍然比您想要的要高(分支通常很频繁,而错误的预测很昂贵),但并不疯狂。
还可以查看L1d的缓存未命中率,也许还有mem_load_retired.l3_miss
(和/或l2_miss
和l1_miss
)的信息,以查看是否确实丢失了负载。例如
perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,\
uops_issued.any,uops_executed.thread,\
mem_load_retired.l3_miss,mem_load_retired.l2_miss,mem_load_retired.l1_miss ./a.out
您可以将这些事件中的任何一个与perf record
一起使用,以获取一些统计数据样本,这些样本中的指令正在导致高速缓存未命中。这些是精确的事件(使用PEBS),因此应准确地映射到正确的指令(不像“周期”那样,计数归因于附近的某条指令,通常是停顿以等待ROB已满的输入,而不是一条指令)生产起来很慢。)
对于非PEBS事件,没有任何偏差,这些事件应该“怪罪”一条指令,但并不总是在正确的位置中断。
如果您要针对本地计算机进行优化,而不需要使其在其他任何地方运行,则可以使用-O3 -march=native
。但这并不能解决高速缓存未命中的问题。
GCC配置文件引导的优化可以帮助它选择分支还是无分支。 (gcc -O3 -march=native -fprofile-generate
/使用一些实际的输入数据运行以生成配置文件输出/ gcc -O3 -march=native -fprofile-use
)
我可以假定分支预测错误是造成这种速度下降的原因吗?
否,缓存未命中的可能性更高。您有大量的L3丢失,并且一直到DRAM花费了数百个核心时钟周期。分支预测可以隐藏某些 正确预测的东西。
优化我的代码的下一步是什么?
如果可能的话,压缩数据结构,使更多的数据结构适合缓存,例如如果您不需要超过4GiB的虚拟地址空间,则使用32位指针(Linux x32 ABI:gcc -mx32
)。或者,也许尝试在大型数组中使用32位无符号索引而不是原始指针,但这会导致负载使用时延稍差(在Sandybridge系列上有几个周期)。
和/或改善您的访问模式,因此您通常按顺序访问它们。因此,硬件预取可以在需要读取它们之前将它们带入缓存。
我对https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform或其在序列比对中的应用还不够熟悉,无法知道是否有可能提高效率,但是数据压缩本质上是有问题的,因为您经常需要依赖于数据的分支和访问分散的数据。不过,通常值得进行权衡,而更多的缓存未命中。