如何计算FLOPS?

时间:2014-03-16 21:02:54

标签: linux performance matlab cpu benchmarking

我在一台服务器上并行运行我的程序(Intel(R)Core(TM)i7-4770 CPU @ 3.40GHz)。该服务器有4个核心,每个核心都有额外的超线程,即总共8个核心/线程。

我发现当我的程序的并行度小于4时,它可以获得几乎线性的加速(见右图)。但是,当大于4时,加速会加剧。所以,我怀疑这是因为FLoat点单位。该服务器只有4个浮点单元。我想通过计算FLOPS(每秒浮点运算)来解释我的实验结果。那么,我怎么能算上FLOPS?他们是否有其他方式来解释这个结果?感谢

Parallelization on a 8core/threads Server

这是我的性能列表:

afancy@ubuntu:$ perf list

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  ref-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]

  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-prefetches                               [Hardware cache event]
  L1-icache-prefetch-misses                          [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  dTLB-prefetches                                    [Hardware cache event]
  dTLB-prefetch-misses                               [Hardware cache event]
  iTLB-loads                                         [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  node-loads                                         [Hardware cache event]
  node-load-misses                                   [Hardware cache event]
  node-stores                                        [Hardware cache event]
  node-store-misses                                  [Hardware cache event]
  node-prefetches                                    [Hardware cache event]
  node-prefetch-misses                               [Hardware cache event]

  rNNN                                               [Raw hardware event descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
   (see 'man perf-list' on how to encode it)

  mem:<addr>[:access]                                [Hardware breakpoint]

以下是perf stat matlab -nodesktop -no jvm<main.m

的结果
======================Num. of cores/threads = 2======================


                 458223.935241 task-clock                #    0.999 CPUs utilized          
                        39,038 context-switches          #    0.085 K/sec                  
                            78 cpu-migrations            #    0.000 K/sec                  
                       459,290 page-faults               #    0.001 M/sec                  
             1,598,967,197,448 cycles                    #    3.489 GHz                    
               <not supported> stalled-cycles-frontend 
               <not supported> stalled-cycles-backend  
             3,052,651,880,341 instructions              #    1.91  insns per cycle        
               675,069,830,714 branches                  # 1473.231 M/sec                  
                 3,699,587,126 branch-misses             #    0.55% of all branches        

                 458.519712953 seconds time elapsed
------------------------------------------------------
     472493.757765 task-clock                #    0.999 CPUs utilized          
            40,231 context-switches          #    0.085 K/sec                  
                83 cpu-migrations            #    0.000 K/sec                  
           454,849 page-faults               #    0.963 K/sec                  
 1,648,754,575,728 cycles                    #    3.489 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
 3,050,973,794,286 instructions              #    1.85  insns per cycle        
   674,701,101,539 branches                  # 1427.958 M/sec                  
     3,854,961,561 branch-misses             #    0.57% of all branches        

     472.810679033 seconds time elapsed

==============    Num. of cores/threads = 4 ==========================


         233673.870204 task-clock                #    0.998 CPUs utilized          
                20,265 context-switches          #    0.087 K/sec                  
                   110 cpu-migrations            #    0.000 K/sec                  
               248,922 page-faults               #    0.001 M/sec                  
       815,466,229,226 cycles                    #    3.490 GHz                    
       <not supported> stalled-cycles-frontend 
       <not supported> stalled-cycles-backend  
     1,528,487,784,122 instructions              #    1.87  insns per cycle        
       338,001,335,905 branches                  # 1446.466 M/sec                  
         1,878,625,642 branch-misses             #    0.56% of all branches        

         234.029335936 seconds time elapsed
---------------------------------------------
     231203.147937 task-clock                #    0.998 CPUs utilized          
            20,028 context-switches          #    0.087 K/sec                  
                91 cpu-migrations            #    0.000 K/sec                  
           249,906 page-faults               #    0.001 M/sec                  
   806,862,892,981 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
 1,525,844,491,295 instructions              #    1.89  insns per cycle        
   337,423,026,113 branches                  # 1459.422 M/sec                  
     1,839,223,079 branch-misses             #    0.55% of all branches        

     231.578239447 seconds time elapsed
 -----------------------------------------
      233813.938379 task-clock                #    0.998 CPUs utilized          
            20,210 context-switches          #    0.086 K/sec                  
                78 cpu-migrations            #    0.000 K/sec                  
           246,951 page-faults               #    0.001 M/sec                  
   815,974,334,825 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
 1,525,890,625,730 instructions              #    1.87  insns per cycle        
   337,426,244,903 branches                  # 1443.140 M/sec                  
     1,981,754,037 branch-misses             #    0.59% of all branches        

     234.193620912 seconds time elapsed
-------------------------------------------------
     233269.315745 task-clock                #    0.998 CPUs utilized          
            20,202 context-switches          #    0.087 K/sec                  
               112 cpu-migrations            #    0.000 K/sec                  
           230,240 page-faults               #    0.987 K/sec                  
   814,074,094,896 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
 1,526,825,737,326 instructions              #    1.88  insns per cycle        
   337,639,762,266 branches                  # 1447.425 M/sec                  
     1,852,788,062 branch-misses             #    0.55% of all branches        

     233.642106982 seconds time elapsed     

====================== Num. of cores/threads = 6 ================


         232682.918326 task-clock                #    0.998 CPUs utilized          
            22,109 context-switches          #    0.095 K/sec                  
                96 cpu-migrations            #    0.000 K/sec                  
           172,440 page-faults               #    0.741 K/sec                  
   811,991,238,956 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
 1,019,407,910,404 instructions              #    1.26  insns per cycle        
   225,426,394,521 branches                  #  968.814 M/sec                  
     1,344,046,527 branch-misses             #    0.60% of all branches        

     233.124504147 seconds time elapsed
 ------------------------------------------    
       210835.066220 task-clock                #    0.998 CPUs utilized          
            18,696 context-switches          #    0.089 K/sec                  
               107 cpu-migrations            #    0.001 K/sec                  
           173,955 page-faults               #    0.825 K/sec                  
   735,764,609,235 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
 1,019,083,429,216 instructions              #    1.39  insns per cycle        
   225,355,627,333 branches                  # 1068.872 M/sec                  
     1,316,268,293 branch-misses             #    0.58% of all branches        

     211.323109113 seconds time elapsed
 ---------------------------------------------    
       179852.029353 task-clock                #    0.998 CPUs utilized          
            15,465 context-switches          #    0.086 K/sec                  
               107 cpu-migrations            #    0.001 K/sec                  
           172,942 page-faults               #    0.962 K/sec                  
   627,644,775,747 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
 1,017,482,864,797 instructions              #    1.62  insns per cycle        
   225,004,972,767 branches                  # 1251.056 M/sec                  
     1,255,067,791 branch-misses             #    0.56% of all branches        

     180.246118105 seconds time elapsed
---------------------------------------------     
     219614.665400 task-clock                #    0.998 CPUs utilized          
            21,290 context-switches          #    0.097 K/sec                  
                90 cpu-migrations            #    0.000 K/sec                  
           170,882 page-faults               #    0.778 K/sec                  
   766,392,860,245 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
 1,017,686,212,128 instructions              #    1.33  insns per cycle        
   225,049,868,367 branches                  # 1024.749 M/sec                  
     1,322,942,620 branch-misses             #    0.59% of all branches        

     220.092311263 seconds time elapsed
----------------------------------------------          
       176764.084715 task-clock                #    0.998 CPUs utilized          
            15,282 context-switches          #    0.086 K/sec                  
                99 cpu-migrations            #    0.001 K/sec                  
           168,629 page-faults               #    0.954 K/sec                  
   616,874,157,735 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
 1,018,436,813,450 instructions              #    1.65  insns per cycle        
   225,214,699,712 branches                  # 1274.098 M/sec                  
     1,271,583,320 branch-misses             #    0.56% of all branches        

     177.198129682 seconds time elapsed   


========================   Num. of cores/threads = 8 ==================


         207252.104133 task-clock                #    0.998 CPUs utilized          
            18,598 context-switches          #    0.090 K/sec                  
                99 cpu-migrations            #    0.000 K/sec                  
           144,037 page-faults               #    0.695 K/sec                  
   723,242,099,542 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   764,758,792,593 instructions              #    1.06  insns per cycle        
   169,108,788,865 branches                  #  815.957 M/sec                  
     1,068,941,156 branch-misses             #    0.63% of all branches        

     207.729752155 seconds time elapsed
 ----------------------------------------------  
      206174.337637 task-clock                #    0.998 CPUs utilized          
            22,188 context-switches          #    0.108 K/sec                  
               118 cpu-migrations            #    0.001 K/sec                  
           132,956 page-faults               #    0.645 K/sec                  
   719,474,677,828 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   765,214,496,607 instructions              #    1.06  insns per cycle        
   169,211,117,316 branches                  #  820.719 M/sec                  
     1,039,836,842 branch-misses             #    0.61% of all branches        

     206.652707435 seconds time elapsed
 ----------------------------------------------  
      205240.082258 task-clock                #    0.989 CPUs utilized          
            44,991 context-switches          #    0.219 K/sec                  
               163 cpu-migrations            #    0.001 K/sec                  
           136,109 page-faults               #    0.663 K/sec                  
   716,133,704,444 cycles                    #    3.489 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   763,898,836,941 instructions              #    1.07  insns per cycle        
   168,924,070,103 branches                  #  823.056 M/sec                  
     1,066,021,420 branch-misses             #    0.63% of all branches        

     207.511466061 seconds time elapsed
 ----------------------------------------------  
      205016.856849 task-clock                #    0.989 CPUs utilized          
            44,386 context-switches          #    0.216 K/sec                  
               180 cpu-migrations            #    0.001 K/sec                  
           133,995 page-faults               #    0.654 K/sec                  
   715,351,228,880 cycles                    #    3.489 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   763,637,525,789 instructions              #    1.07  insns per cycle        
   168,860,189,098 branches                  #  823.641 M/sec                  
     1,056,980,771 branch-misses             #    0.63% of all branches        

     207.231704712 seconds time elapsed
 ----------------------------------------------  
      205388.150659 task-clock                #    0.998 CPUs utilized          
            21,328 context-switches          #    0.104 K/sec                  
               103 cpu-migrations            #    0.001 K/sec                  
           135,843 page-faults               #    0.661 K/sec                  
   716,737,227,792 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   764,359,316,365 instructions              #    1.07  insns per cycle        
   169,023,595,573 branches                  #  822.947 M/sec                  
     1,045,914,789 branch-misses             #    0.62% of all branches        

     205.857635295 seconds time elapsed
 ----------------------------------------------  
      207178.729781 task-clock                #    0.998 CPUs utilized          
            17,956 context-switches          #    0.087 K/sec                  
               105 cpu-migrations            #    0.001 K/sec                  
           137,996 page-faults               #    0.666 K/sec                  
   722,998,617,131 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   763,085,695,510 instructions              #    1.06  insns per cycle        
   168,733,709,256 branches                  #  814.435 M/sec                  
     1,052,517,264 branch-misses             #    0.62% of all branches        

     207.608998891 seconds time elapsed
 ----------------------------------------------  
      206701.393252 task-clock                #    0.998 CPUs utilized          
            24,596 context-switches          #    0.119 K/sec                  
               137 cpu-migrations            #    0.001 K/sec                  
           136,553 page-faults               #    0.661 K/sec                  
   721,294,495,478 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   764,246,861,748 instructions              #    1.06  insns per cycle        
   168,997,611,020 branches                  #  817.593 M/sec                  
     1,050,078,827 branch-misses             #    0.62% of all branches        

     207.206805179 seconds time elapsed
 ----------------------------------------------  
          206455.394644 task-clock                #    0.997 CPUs utilized          
            26,089 context-switches          #    0.126 K/sec                  
                87 cpu-migrations            #    0.000 K/sec                  
           132,658 page-faults               #    0.643 K/sec                  
   720,429,194,133 cycles                    #    3.490 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   764,339,875,802 instructions              #    1.06  insns per cycle        
   169,014,685,081 branches                  #  818.650 M/sec                  
     1,047,046,966 branch-misses             #    0.62% of all branches        

     206.982094466 seconds time elapsed

0 个答案:

没有答案