我在一台服务器上并行运行我的程序(Intel(R)Core(TM)i7-4770 CPU @ 3.40GHz)。该服务器有4个核心,每个核心都有额外的超线程,即总共8个核心/线程。
我发现当我的程序的并行度小于4时,它可以获得几乎线性的加速(见右图)。但是,当大于4时,加速会加剧。所以,我怀疑这是因为FLoat点单位。该服务器只有4个浮点单元。我想通过计算FLOPS(每秒浮点运算)来解释我的实验结果。那么,我怎么能算上FLOPS?他们是否有其他方式来解释这个结果?感谢
这是我的性能列表:
afancy@ubuntu:$ perf list
List of pre-defined events (to be used in -e):
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
ref-cycles [Hardware event]
cpu-clock [Software event]
task-clock [Software event]
page-faults OR faults [Software event]
context-switches OR cs [Software event]
cpu-migrations OR migrations [Software event]
minor-faults [Software event]
major-faults [Software event]
alignment-faults [Software event]
emulation-faults [Software event]
L1-dcache-loads [Hardware cache event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
L1-dcache-store-misses [Hardware cache event]
L1-dcache-prefetches [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
L1-icache-loads [Hardware cache event]
L1-icache-load-misses [Hardware cache event]
L1-icache-prefetches [Hardware cache event]
L1-icache-prefetch-misses [Hardware cache event]
LLC-loads [Hardware cache event]
LLC-load-misses [Hardware cache event]
LLC-stores [Hardware cache event]
LLC-store-misses [Hardware cache event]
LLC-prefetches [Hardware cache event]
LLC-prefetch-misses [Hardware cache event]
dTLB-loads [Hardware cache event]
dTLB-load-misses [Hardware cache event]
dTLB-stores [Hardware cache event]
dTLB-store-misses [Hardware cache event]
dTLB-prefetches [Hardware cache event]
dTLB-prefetch-misses [Hardware cache event]
iTLB-loads [Hardware cache event]
iTLB-load-misses [Hardware cache event]
branch-loads [Hardware cache event]
branch-load-misses [Hardware cache event]
node-loads [Hardware cache event]
node-load-misses [Hardware cache event]
node-stores [Hardware cache event]
node-store-misses [Hardware cache event]
node-prefetches [Hardware cache event]
node-prefetch-misses [Hardware cache event]
rNNN [Raw hardware event descriptor]
cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor]
(see 'man perf-list' on how to encode it)
mem:<addr>[:access] [Hardware breakpoint]
以下是perf stat matlab -nodesktop -no jvm<main.m
:
======================Num. of cores/threads = 2======================
458223.935241 task-clock # 0.999 CPUs utilized
39,038 context-switches # 0.085 K/sec
78 cpu-migrations # 0.000 K/sec
459,290 page-faults # 0.001 M/sec
1,598,967,197,448 cycles # 3.489 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
3,052,651,880,341 instructions # 1.91 insns per cycle
675,069,830,714 branches # 1473.231 M/sec
3,699,587,126 branch-misses # 0.55% of all branches
458.519712953 seconds time elapsed
------------------------------------------------------
472493.757765 task-clock # 0.999 CPUs utilized
40,231 context-switches # 0.085 K/sec
83 cpu-migrations # 0.000 K/sec
454,849 page-faults # 0.963 K/sec
1,648,754,575,728 cycles # 3.489 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
3,050,973,794,286 instructions # 1.85 insns per cycle
674,701,101,539 branches # 1427.958 M/sec
3,854,961,561 branch-misses # 0.57% of all branches
472.810679033 seconds time elapsed
============== Num. of cores/threads = 4 ==========================
233673.870204 task-clock # 0.998 CPUs utilized
20,265 context-switches # 0.087 K/sec
110 cpu-migrations # 0.000 K/sec
248,922 page-faults # 0.001 M/sec
815,466,229,226 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,528,487,784,122 instructions # 1.87 insns per cycle
338,001,335,905 branches # 1446.466 M/sec
1,878,625,642 branch-misses # 0.56% of all branches
234.029335936 seconds time elapsed
---------------------------------------------
231203.147937 task-clock # 0.998 CPUs utilized
20,028 context-switches # 0.087 K/sec
91 cpu-migrations # 0.000 K/sec
249,906 page-faults # 0.001 M/sec
806,862,892,981 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,525,844,491,295 instructions # 1.89 insns per cycle
337,423,026,113 branches # 1459.422 M/sec
1,839,223,079 branch-misses # 0.55% of all branches
231.578239447 seconds time elapsed
-----------------------------------------
233813.938379 task-clock # 0.998 CPUs utilized
20,210 context-switches # 0.086 K/sec
78 cpu-migrations # 0.000 K/sec
246,951 page-faults # 0.001 M/sec
815,974,334,825 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,525,890,625,730 instructions # 1.87 insns per cycle
337,426,244,903 branches # 1443.140 M/sec
1,981,754,037 branch-misses # 0.59% of all branches
234.193620912 seconds time elapsed
-------------------------------------------------
233269.315745 task-clock # 0.998 CPUs utilized
20,202 context-switches # 0.087 K/sec
112 cpu-migrations # 0.000 K/sec
230,240 page-faults # 0.987 K/sec
814,074,094,896 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,526,825,737,326 instructions # 1.88 insns per cycle
337,639,762,266 branches # 1447.425 M/sec
1,852,788,062 branch-misses # 0.55% of all branches
233.642106982 seconds time elapsed
====================== Num. of cores/threads = 6 ================
232682.918326 task-clock # 0.998 CPUs utilized
22,109 context-switches # 0.095 K/sec
96 cpu-migrations # 0.000 K/sec
172,440 page-faults # 0.741 K/sec
811,991,238,956 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,019,407,910,404 instructions # 1.26 insns per cycle
225,426,394,521 branches # 968.814 M/sec
1,344,046,527 branch-misses # 0.60% of all branches
233.124504147 seconds time elapsed
------------------------------------------
210835.066220 task-clock # 0.998 CPUs utilized
18,696 context-switches # 0.089 K/sec
107 cpu-migrations # 0.001 K/sec
173,955 page-faults # 0.825 K/sec
735,764,609,235 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,019,083,429,216 instructions # 1.39 insns per cycle
225,355,627,333 branches # 1068.872 M/sec
1,316,268,293 branch-misses # 0.58% of all branches
211.323109113 seconds time elapsed
---------------------------------------------
179852.029353 task-clock # 0.998 CPUs utilized
15,465 context-switches # 0.086 K/sec
107 cpu-migrations # 0.001 K/sec
172,942 page-faults # 0.962 K/sec
627,644,775,747 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,017,482,864,797 instructions # 1.62 insns per cycle
225,004,972,767 branches # 1251.056 M/sec
1,255,067,791 branch-misses # 0.56% of all branches
180.246118105 seconds time elapsed
---------------------------------------------
219614.665400 task-clock # 0.998 CPUs utilized
21,290 context-switches # 0.097 K/sec
90 cpu-migrations # 0.000 K/sec
170,882 page-faults # 0.778 K/sec
766,392,860,245 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,017,686,212,128 instructions # 1.33 insns per cycle
225,049,868,367 branches # 1024.749 M/sec
1,322,942,620 branch-misses # 0.59% of all branches
220.092311263 seconds time elapsed
----------------------------------------------
176764.084715 task-clock # 0.998 CPUs utilized
15,282 context-switches # 0.086 K/sec
99 cpu-migrations # 0.001 K/sec
168,629 page-faults # 0.954 K/sec
616,874,157,735 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,018,436,813,450 instructions # 1.65 insns per cycle
225,214,699,712 branches # 1274.098 M/sec
1,271,583,320 branch-misses # 0.56% of all branches
177.198129682 seconds time elapsed
======================== Num. of cores/threads = 8 ==================
207252.104133 task-clock # 0.998 CPUs utilized
18,598 context-switches # 0.090 K/sec
99 cpu-migrations # 0.000 K/sec
144,037 page-faults # 0.695 K/sec
723,242,099,542 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
764,758,792,593 instructions # 1.06 insns per cycle
169,108,788,865 branches # 815.957 M/sec
1,068,941,156 branch-misses # 0.63% of all branches
207.729752155 seconds time elapsed
----------------------------------------------
206174.337637 task-clock # 0.998 CPUs utilized
22,188 context-switches # 0.108 K/sec
118 cpu-migrations # 0.001 K/sec
132,956 page-faults # 0.645 K/sec
719,474,677,828 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
765,214,496,607 instructions # 1.06 insns per cycle
169,211,117,316 branches # 820.719 M/sec
1,039,836,842 branch-misses # 0.61% of all branches
206.652707435 seconds time elapsed
----------------------------------------------
205240.082258 task-clock # 0.989 CPUs utilized
44,991 context-switches # 0.219 K/sec
163 cpu-migrations # 0.001 K/sec
136,109 page-faults # 0.663 K/sec
716,133,704,444 cycles # 3.489 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
763,898,836,941 instructions # 1.07 insns per cycle
168,924,070,103 branches # 823.056 M/sec
1,066,021,420 branch-misses # 0.63% of all branches
207.511466061 seconds time elapsed
----------------------------------------------
205016.856849 task-clock # 0.989 CPUs utilized
44,386 context-switches # 0.216 K/sec
180 cpu-migrations # 0.001 K/sec
133,995 page-faults # 0.654 K/sec
715,351,228,880 cycles # 3.489 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
763,637,525,789 instructions # 1.07 insns per cycle
168,860,189,098 branches # 823.641 M/sec
1,056,980,771 branch-misses # 0.63% of all branches
207.231704712 seconds time elapsed
----------------------------------------------
205388.150659 task-clock # 0.998 CPUs utilized
21,328 context-switches # 0.104 K/sec
103 cpu-migrations # 0.001 K/sec
135,843 page-faults # 0.661 K/sec
716,737,227,792 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
764,359,316,365 instructions # 1.07 insns per cycle
169,023,595,573 branches # 822.947 M/sec
1,045,914,789 branch-misses # 0.62% of all branches
205.857635295 seconds time elapsed
----------------------------------------------
207178.729781 task-clock # 0.998 CPUs utilized
17,956 context-switches # 0.087 K/sec
105 cpu-migrations # 0.001 K/sec
137,996 page-faults # 0.666 K/sec
722,998,617,131 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
763,085,695,510 instructions # 1.06 insns per cycle
168,733,709,256 branches # 814.435 M/sec
1,052,517,264 branch-misses # 0.62% of all branches
207.608998891 seconds time elapsed
----------------------------------------------
206701.393252 task-clock # 0.998 CPUs utilized
24,596 context-switches # 0.119 K/sec
137 cpu-migrations # 0.001 K/sec
136,553 page-faults # 0.661 K/sec
721,294,495,478 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
764,246,861,748 instructions # 1.06 insns per cycle
168,997,611,020 branches # 817.593 M/sec
1,050,078,827 branch-misses # 0.62% of all branches
207.206805179 seconds time elapsed
----------------------------------------------
206455.394644 task-clock # 0.997 CPUs utilized
26,089 context-switches # 0.126 K/sec
87 cpu-migrations # 0.000 K/sec
132,658 page-faults # 0.643 K/sec
720,429,194,133 cycles # 3.490 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
764,339,875,802 instructions # 1.06 insns per cycle
169,014,685,081 branches # 818.650 M/sec
1,047,046,966 branch-misses # 0.62% of all branches
206.982094466 seconds time elapsed