Question

嗯，这将是一个没有细节的问题，因为我不知道如何更好地解释。抱歉。我有一个内存密集型的C程序（很多指针）。我有一个来源，它由我用gcc -O2编译。我在Ubuntu Linux上。在程序的开始和结束时，调用clock（）来测量经过的时间。而且，我正在使用时间命令检查时间。问题在于，相同的程序有时会在不改变任何内容的情况下快20％（或更慢）。

$ date; time ./cudd-example-8queens
pon jun 20 00:49:05 CEST 2016
CPU TIME = 6.46
real    0m6.475s
user    0m6.405s
sys 0m0.067s

$ date; time ./cudd-example-8queens
pon jun 20 00:49:16 CEST 2016
CPU TIME = 8.03
real    0m8.051s
user    0m7.995s
sys 0m0.048s

$ date; time ./cudd-example-8queens
pon jun 20 00:49:33 CEST 2016
CPU TIME = 6.48
real    0m6.490s
user    0m6.445s
sys 0m0.040s

$ date; time ./cudd-example-8queens
pon jun 20 00:49:42 CEST 2016
CPU TIME = 6.45
real    0m6.469s
user    0m6.424s
sys 0m0.040s

$ date; time ./cudd-example-8queens
pon jun 20 00:49:56 CEST 2016
CPU TIME = 8.04
real    0m8.058s
user    0m7.982s
sys 0m0.068s

我的问题是：如何解释这种差异，即花费多少1.5s（有时甚至更糟）？它必须是内存访问的东西，但如何检查？

编辑：我已经安装了perf，这里有两个结果（我已更新它们以显示从cpupower获得的信息）。关于目标，我正在比较科学算法，这对我来说很重要，例如一个比其他人快10％。

$ date; cpupower -c all frequency-info -f; perf stat -B ./cudd-example-8queens 
pon jun 20 12:39:21 CEST 2016
analyzing CPU 0:
1300000
analyzing CPU 1:
1300000
analyzing CPU 2:
1300000
analyzing CPU 3:
1300000
clock() TIME = 6.70
clock_gettime() TIME = 6.70

 Performance counter stats for './cudd-example-8queens':

       6705,796274 task-clock (msec)         #    0,999 CPUs utilized          
               104 context-switches          #    0,016 K/sec                  
                 3 cpu-migrations            #    0,000 K/sec                  
             30861 page-faults               #    0,005 M/sec                  
       17295862806 cycles                    #    2,579 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        7361712951 instructions              #    0,43  insns per cycle        
        1228059232 branches                  #  183,134 M/sec                  
          64491733 branch-misses             #    5,25% of all branches        

       6,709414218 seconds time elapsed

$ date; cpupower -c all frequency-info -f; perf stat -B ./cudd-example-8queens 
pon jun 20 12:39:30 CEST 2016
analyzing CPU 0:
1300000
analyzing CPU 1:
1300000
analyzing CPU 2:
1300000
analyzing CPU 3:
1300000
clock() TIME = 8.43
clock_gettime() TIME = 8.43

 Performance counter stats for './cudd-example-8queens':

       8441,824238 task-clock (msec)         #    0,999 CPUs utilized          
               145 context-switches          #    0,017 K/sec                  
                 3 cpu-migrations            #    0,000 K/sec                  
             30863 page-faults               #    0,004 M/sec                  
       13958245339 cycles                    #    1,653 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
        7360082448 instructions              #    0,53  insns per cycle        
        1227803521 branches                  #  145,443 M/sec                  
          64517871 branch-misses             #    5,25% of all branches        

       8,446645648 seconds time elapsed

EDIT2：我的英特尔NUC配备了英特尔酷睿i5-4250U CPU。因此，建议使用＆＃34; cpupower频率设置＆＃34;很有希望，但不幸的是它没有任何帮助。而且，我使用＆＃34; clock（）＆＃34;得到完全相同的结果。和＆＃34; clock_gettime（CLOCK_PROCESS_CPUTIME_ID）＆＃34;这些结果也得到了执行时钟（msec）＆＃34;的确认。

Answer 1

现代CPU具有动态变化的频率，您应该始终测量不仅是墙上时间（天文时间），还要测量cpu周期数。 perf stat（实际上，perf stat -e task-clock,cycles,instructions已足够）显示您在cycles行中运行程序时的CPU核心频率，如果有cpu-clock / task-clock事件来测量墙时间（周期除以获得GHz的时间）：

 #### cycles                    #    1,653 GHz    

 #### cycles                    #    2,579 GHz

这是Intel Turbo Boost（2），https://en.wikipedia.org/wiki/Intel_Turbo_Boost（AMD有https://en.wikipedia.org/wiki/AMD_Turbo_Core）。两者都非常快，所以当cpupower -c all frequency-info运行时，实际频率很低（1.3）;但是当你的程序负载很高时，CPU会将其频率调整到更高的水平几微秒。

有时可以在BIOS中关闭它以获得更均匀的测量：http://www.intel.com/content/www/us/en/support/processors/000005641.html

如何启用或禁用英特尔®睿频加速技术？ - 英特尔®睿频加速技术通常默认启用。您只能通过BIOS中的开关禁用和启用该技术。没有其他用户可控制的设置。

或者您可以尝试一些神奇的MSR写入（不要将随机值写入随机msr regs，它可能会破坏某些内容，或挂起PC）：https://askubuntu.com/questions/619875/disabling-intel-turbo-boost-in-ubuntu Maythux回答：“wrmsr -pC 0x1a0 0x4000850089”< / p>

来自perf stat的其他行：7361-7360毫升指令，1228-1227毫升分支64毫升分支错误预测表明程序是相同的并且执行了相同的代码（没有外部随机）。您也可以尝试perf stat -d（更好的是从stat -d中选择一些工作硬件事件并在perf stat -e cpu-clock,....中手动列出）以检查运行之间的缓存事件差异。

Answer 2

操作系统同时运行多个任务并对其进行管理，它执行从一个任务到另一个任务的上下文切换。因此，不是每次更改代码的时间，而是在程序执行期间操作系统调用的其他程序。

为什么相同的C程序有时要快得多

2 个答案: