当有足够的可用RAM时使用交换。性能受到影响

时间:2018-05-08 17:12:46

标签: c++ linux performance memory

我写了一个简单的程序来研究在Linux上使用大量RAM时的性能(64位红帽企业Linux服务器版本6.4)。 (请忽略内存泄漏。)

#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;

double getWallTime()
{
  struct timeval time;
  if (gettimeofday(&time, NULL))
  {
    return 0;
  }
  return (double)time.tv_sec + (double)time.tv_usec * .000001;
}


int main()
{
  int *a;
  int n = 1000000000;
  do
  {
    time_t mytime = time(NULL);
    char * time_str = ctime(&mytime);
    time_str[strlen(time_str)-1] = '\0';
    printf("Current Time : %s\n", time_str);
    double start = getWallTime();
    a = new int[n];
    for (int i = 0; i < n; i++)
    {
      a[i] = 1;
    }
    double elapsed = getWallTime()-start;
    cout << elapsed << endl;
    cout << "Allocated." << endl;
  }
  while (1);

  return 0;
}

输出

Current Time : Tue May  8 11:46:55 2018
3.73667
Allocated.
Current Time : Tue May  8 11:46:59 2018
64.5222
Allocated.
Current Time : Tue May  8 11:48:03 2018
110.419

最高输出如下。尽管有足够的可用RAM,但我们可以看到交换增加。结果是运行时间从3秒猛增到64秒。

top - 11:46:55 up 21 days,  1:14, 18 users,  load average: 1.24, 1.25, 0.95
Tasks: 819 total,   3 running, 816 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.6%us,  1.4%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132110088k total, 127500344k used,  4609744k free,   262288k buffers
Swap: 10485752k total,     4112k used, 10481640k free, 45988192k cached

top - 11:47:01 up 21 days,  1:14, 18 users,  load average: 1.38, 1.27, 0.96
Tasks: 819 total,   2 running, 817 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.5%us,  2.1%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132110088k total, 131620156k used,   489932k free,   262288k buffers
Swap: 10485752k total,     4112k used, 10481640k free, 45844228k cached

top - 11:47:53 up 21 days,  1:15, 18 users,  load average: 1.25, 1.26, 0.97
Tasks: 819 total,   2 running, 817 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  2.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132110088k total, 131626300k used,   483788k free,   262276k buffers
Swap: 10485752k total,     5464k used, 10480288k free, 43056696k cached

top - 11:47:56 up 21 days,  1:15, 18 users,  load average: 1.23, 1.26, 0.97
Tasks: 819 total,   2 running, 817 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  2.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132110088k total, 131627568k used,   482520k free,   262276k buffers
Swap: 10485752k total,     5792k used, 10479960k free, 42949788k cached

top - 11:47:59 up 21 days,  1:15, 18 users,  load average: 1.21, 1.25, 0.97
Tasks: 819 total,   2 running, 817 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  2.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132110088k total, 131623080k used,   487008k free,   262276k buffers
Swap: 10485752k total,     6312k used, 10479440k free, 42840068k cached

top - 11:48:02 up 21 days,  1:15, 18 users,  load average: 1.21, 1.25, 0.97
Tasks: 819 total,   2 running, 817 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  2.5%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132110088k total, 131620016k used,   490072k free,   262276k buffers
Swap: 10485752k total,     6772k used, 10478980k free, 42729276k cached

我看了thisthis。我的问题是

  1. 为什么Linux会牺牲性能而不是完全使用缓存的RAM?内存碎片?但是将数据放在交换上肯定会造成碎片化。

  2. 是否有一种解决方法可以在达到物理RAM大小之前获得一致的3秒钟?

  3. 感谢。

    更新1: 从顶部添加更多输出。

    更新2: 按照David的建议,查看/ proc // io显示我的程序没有I / O.所以大卫的第一个答案应该解释这个观察。现在谈谈我的第二个问题。如何以非root用户身份提高性能(无法修改swappiness等)。

    更新3:我切换到另一台机器,因为我需要sudo一些命令。这是一台真正的机器(没有虚拟机),带有Intel(R)Xeon(R)CPU E5-2680 0 @ 2.70GHz。该机器有16个物理核心。

    uname -a
    2.6.32-642.4.2.el6.x86_64 #1 SMP Tue Aug 23 19:58:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
    

    运行带有更多迭代的osgx修改代码

    Iteration 451
    Time to malloc: 1.81198e-05
    Time to fill with data: 0.109081
    Fill rate with data: **916**.75 Mints/sec, 3667Mbytes/sec
    Time to second write access of data: 0.049731
    Access rate of data: 2010.82 Mints/sec, 8043.27Mbytes/sec
    Time to third write access of data: 0.0478709
    Access rate of data: 2088.95 Mints/sec, 8355.81Mbytes/sec
    Allocated 400 Mbytes, with total memory allocated 180800Mbytes
    Iteration 452
    Time to malloc: 1.09673e-05
    Time to fill with data: 5.16316
    Fill rate with data: **19**.368 Mints/sec, 77.4719Mbytes/sec
    Time to second write access of data: 0.0495219
    Access rate of data: 2019.31 Mints/sec, 8077.23Mbytes/sec
    Time to third write access of data: 0.0439548
    Access rate of data: 2275.06 Mints/sec, 9100.25Mbytes/sec
    Allocated 400 Mbytes, with total memory allocated 181200Mbytes
    

    当发生减速时,我确实看到内核从2MB页面切换到4KB页面。

    vmstat 1 60
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     2  0 1217396 11506356 5911040 47499184    0    2    35    47    0    0 14  2 84  0  0  
     2  0 1217396 11305860 5911040 47499184    4    0     4    36 5163 3460  7  6 87  0  0  
     2  0 1217396 11112744 5911040 47499188    0    0     0     0 4326 3451  7  6 87  0  0  
     2  0 1217396 10980556 5911040 47499188    0    0     0     0 4801 3385  7  6 87  0  0  
     2  0 1217396 10845940 5911040 47499192    0    0     0    20 4650 3596  7  6 87  0  0  
     2  0 1217396 10712508 5911040 47499200    0    0     0     0 5743 3562  7  6 87  0  0  
     2  0 1217396 10583380 5911040 47499200    0    0     0    40 4531 3622  7  6 87  0  0  
     2  0 1217396 10449096 5911040 47499200    0    0     0     0 4516 3629  7  6 87  0  0  
     2  0 1217396 10187856 5911040 47499200    0    0     0     0 4499 3456  7  6 87  0  0  
     2  0 1217396 10053256 5911040 47499204    0    0     0     8 5334 3507  7  6 87  0  0  
     2  0 1217396 9921624 5911040 47499204    0    0     0     0 6310 3593  6  6 87  0  0   
     2  0 1217396 9788532 5911040 47499208    0    0     0    44 5794 3516  7  6 87  0  0   
     2  0 1217396 9660516 5911040 47499208    0    0     0     0 4894 3535  7  6 87  0  0   
     2  0 1217396 9527552 5911040 47499212    0    0     0     0 4686 3570  7  6 87  0  0   
     2  0 1217396 9396536 5911040 47499212    0    0     0     0 4805 3538  7  6 87  0  0   
     2  0 1217396 9238664 5911040 47499212    0    0     0     0 5940 3459  7  6 87  0  0   
     2  0 1217396 9000136 5911040 47499216    0    0     0    32 5239 3333  7  6 87  0  0   
     2  0 1217396 8861132 5911040 47499220    0    0     0     0 5579 3351  7  6 87  0  0   
     2  0 1217396 8733688 5911040 47499220    0    0     0     0 4910 3199  7  6 87  0  0   
     2  0 1217396 8596600 5911040 47499224    0    0     0    44 5075 3453  7  6 87  0  0   
     2  0 1217396 8338468 5911040 47499232    0    0     0     0 5328 3444  7  6 87  0  0   
     2  0 1217396 8207732 5911040 47499232    0    0     0    52 5474 3370  7  6 87  0  0   
     2  0 1217396 8071212 5911040 47499236    0    0     0     0 5442 3419  7  6 87  0  0   
     2  0 1217396 7807736 5911040 47499236    0    0     0     0 6139 3456  7  6 87  0  0   
     2  0 1217396 7676080 5911044 47499232    0    0     0    16 4533 3430  6  6 87  0  0   
     2  0 1217396 7545728 5911044 47499236    0    0     0     0 6712 3957  7  6 87  0  0   
     4  0 1217396 7412444 5911044 47499240    0    0     0    68 6110 3547  7  6 87  0  0   
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     2  0 1217396 7280148 5911048 47499244    0    0     0    68 6140 3516  7  7 86  0  0   
     2  0 1217396 7147836 5911048 47499244    0    0     0     0 4434 3400  7  6 87  0  0   
     2  0 1217396 6886980 5911048 47499248    0    0     0    16 7354 3393  7  6 87  0  0   
     2  0 1217396 6752868 5911048 47499248    0    0     0     0 5286 3573  7  6 87  0  0   
     2  0 1217396 6621772 5911048 47499248    0    0     0     0 5353 3410  7  6 87  0  0   
     2  0 1217396 6489760 5911048 47499252    0    0     0    48 5172 3454  7  6 87  0  0   
     2  0 1217396 6248732 5911048 47499256    0    0     0     0 5266 3411  7  6 87  0  0   
     2  0 1217396 6092804 5911048 47499260    0    0     0     4 6345 3473  7  6 87  0  0   
     2  0 1217396 5962544 5911048 47499260    0    0     0     0 7399 3712  7  6 87  0  0   
     2  0 1217396 5828492 5911048 47499264    0    0     0     0 5804 3516  7  6 87  0  0   
     2  0 1217396 5566720 5911048 47499264    0    0     0    44 5800 3370  7  6 87  0  0   
     2  0 1217396 5434204 5911048 47499264    0    0     0     0 6716 3446  7  6 87  0  0   
     2  0 1217396 5240724 5911048 47499268    0    0     0    68 3948 3346  7  6 87  0  0   
     2  0 1217396 5051688 5911008 47484936    0    0     0     0 4743 3734  7  6 87  0  0   
     2  0 1217396 4925680 5910500 47478444    0    0   136     0 5978 3779  7  6 87  0  0   
     2  0 1217396 4801744 5908552 47471820    0    0     0    32 4573 3237  7  6 87  0  0   
     2  0 1217396 4675772 5908552 47463984    0    0     0     0 6594 3276  7  6 87  0  0   
     2  0 1217396 4486472 5908444 47455736    0    0     0     4 6096 3256  7  6 87  0  0   
     2  0 1217396 4299908 5908392 47446964    0    0     0     0 5569 3525  7  6 87  0  0   
     2  0 1217396 4175444 5906884 47440024    0    0     0     0 4975 3141  7  6 87  0  0   
     2  0 1217396 4063472 5905976 47423860    0    0     0    56 6255 3147  6  6 87  0  0   
     2  0 1217396 3939816 5905796 47415596    0    0     0     0 5396 3143  7  6 87  0  0   
     2  0 1217396 3686540 5905796 47407152    0    0     0    44 6471 3201  7  6 87  0  0   
     2  0 1217396 3557596 5905796 47398892    0    0     0     0 7581 3727  7  6 87  0  0   
     2  0 1217396 3445536 5905796 47381812    0    0     0     0 5560 3222  7  6 87  0  0   
     2  0 1217396 3250272 5905796 47373364    0    0     0    60 5594 3343  7  6 87  0  0   
     2  0 1217396 3065232 5903744 47367156    0    0     0     0 5595 3182  7  6 87  0  0   
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     3  0 1217396 2951704 5903028 47350792    0    0     0    12 5210 3262  7  6 87  0  0   
     2  0 1217396 2829228 5902928 47342444    0    0     0     0 5724 3758  7  6 87  0  0   
     2  0 1217396 2575248 5902580 47334472    0    0     0     0 4377 3369  7  6 87  0  0   
     2  0 1217396 2527996 5897796 47322436    0    0     0    60 5550 3570  7  6 87  0  0   
     2  0 1217396 2398672 5893572 47322324    0    0     0     0 5603 3225  7  6 87  0  0   
     2  0 1217396 2272536 5889364 47322228    0    0     0    16 6924 3310  7  6 87  0  0   
    
    iostat -xyz 1 60
    Linux 2.6.32-642.4.2.el6.x86_64     05/09/2018  _x86_64_    (16 CPU)
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               6.64    0.00    6.26    0.00    0.00   87.10
    
    Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               7.00    0.06    5.69    0.00    0.00   87.24
    

    我设法做了“sudo perf top”,并在发生减速时看到了这一点。

    16.84%  [kernel]                                      [k] compaction_alloc
    

    从顶部开始。还有其他几个进程在运行(未显示)。

    Tasks: 799 total,   5 running, 787 sleeping,   4 stopped,   3 zombie
    Cpu(s): 23.1%us, 16.7%sy,  0.0%ni, 60.0%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
    Mem:  264503640k total, 256749480k used,  7754160k free,  5830508k buffers
    Swap: 409259004k total,  1217112k used, 408041892k free, 50458600k cached
    
      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                   
    23559 toddwz   20   0  165g 164g 1204 R 93.0 65.4   2:05.51 a.out                                                     
    

    更新4 关闭THP后,我看到以下内容。在我的程序使用240GB RAM(缓存RAM <1GB)之前,填充率大约为550 Mint / sec(THP开启为900)。然后交换开始,所以填充率下降。

    Iteration 610
    Time to malloc: 1.3113e-05
    Time to fill with data: 0.181151
    Fill rate with data: 552.025 Mints/sec, 2208.1Mbytes/sec
    Time to second write access of data: 0.04074
    Access rate of data: 2454.59 Mints/sec, 9818.36Mbytes/sec
    Time to third write access of data: 0.0420492
    Access rate of data: 2378.17 Mints/sec, 9512.67Mbytes/sec
    Allocated 400 Mbytes, with total memory allocated 244400Mbytes
    Iteration 611
    Time to malloc: 1.88351e-05
    Time to fill with data: 0.306215
    Fill rate with data: 326.568 Mints/sec, 1306.27Mbytes/sec
    Time to second write access of data: 0.045784
    Access rate of data: 2184.17 Mints/sec, 8736.68Mbytes/sec
    Time to third write access of data: 0.0441492
    Access rate of data: 2265.05 Mints/sec, 9060.19Mbytes/sec
    Allocated 400 Mbytes, with total memory allocated 244800Mbytes
    Iteration 612
    Time to malloc: 2.21729e-05
    Time to fill with data: 1.33305
    Fill rate with data: 75.016 Mints/sec, 300.064Mbytes/sec
    Time to second write access of data: 0.048573
    Access rate of data: 2058.76 Mints/sec, 8235.02Mbytes/sec
    Time to third write access of data: 0.0495481
    Access rate of data: 2018.24 Mints/sec, 8072.96Mbytes/sec
    Allocated 400 Mbytes, with total memory allocated 245200Mbytes
    

    结论 关闭透明大页面(THP),我的程序行为对我来说更透明,所以我将继续关闭THP。对于我的特定程序,原因是THP不交换。感谢所有帮助过的人。

1 个答案:

答案 0 :(得分:2)

由于THP,测试的第一次迭代可能会使用huge pages (2 MB pages):透明巨页 - https://www.kernel.org/doc/Documentation/vm/transhuge.txt - 在执行测试期间检查/ sys / kernel / mm / transparent_hugepage / enabled和grep AnonHugePages /proc/meminfo

  

应用程序运行速度更快的原因是两个   因素。第一个因素几乎完全不相关,但事实并非如此   非常感兴趣,因为它也有下行空间   在页面错误中需要更大的清晰页面复制页面   潜在的负面影响。第一个因素包括采取   用户区触及的每个2M虚拟区域的单页错误(如此   将进入/退出内核频率降低512倍因子)。这个   唯一重要的是第一次访问内存的生命周期   内存映射。

使用newmalloc分配大量内存由单个系统调用mmap提供,这通常不会填充&#34;有物理页面的虚拟内存,请在MADV_POPULATE:

周围查看man mmap
   MAP_POPULATE (since Linux 2.5.46)
          Populate (prefault) page tables for a mapping. ... This will help
          to reduce blocking on page faults later.

此内存仅由mmap(没有MAP_POPULATE)注册,因为页表中禁止虚拟和写访问。当您的测试尝试首先写入任何内存页时,操作系统内核会生成并处理页面错误异常。 Linux内核将分配一些物理内存并将虚拟页面映射到物理(填充页面)。启用THP(通常启用),内核可以分配单个huge page of 2MB,如果它有一些免费的大型物理页面。如果没有免费的大页面,内核将分配4KB页面。因此,如果没有大页面,您将有512倍的页面错误(可以通过在测试运行时在另一个控制台中运行vmstat 1 180perf stat -I 1000)来检查。

对填充页面的下一次访问不会出现页面错误,因此您可以使用第二个(第三个)for i in (0..N-1): a[i] = 1;循环扩展测试并测量两个循环的时间。

你的结果听起来仍然很奇怪。您的系统是真实的还是虚拟化的?管理程序可能支持2 MB页面,虚拟系统可能需要更多的内存分配和异常处理成本。

在内存较少的PC上,当页面错误从大页面分配切换到4KB页面分配时,我有10%的速度减慢(从page-faults检查perf stat字符串 - 只有大约2千个每秒页面错误,2MB页面和> 200,000页错误,4KB页面):

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$ perf stat -I1000 ./a.out
Iteration 0
Time to malloc: 8.10623e-06
Time to fill with data: 0.364378
Fill rate with data: 274.44 Mints/sec, 1097.76Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 1.90735e-05
Time to fill with data: 0.357983
Fill rate with data: 279.343 Mints/sec, 1117.37Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 1.69277e-05
#           time             counts unit events
     1.000414902         999.893040      task-clock (msec)
     1.000414902                  1      context-switches          #    0.001 K/sec
     1.000414902                  0      cpu-migrations            #    0.000 K/sec
     1.000414902              2,024      page-faults               #    0.002 M/sec
     1.000414902      2,664,963,857      cycles                    #    2.665 GHz
     1.000414902      3,072,781,834      instructions              #    1.15  insn per cycle
     1.000414902        559,551,437      branches                  #  559.611 M/sec
     1.000414902             25,176      branch-misses             #    0.00% of all branches
Time to fill with data: 0.357014
Fill rate with data: 280.101 Mints/sec, 1120.4Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 1.71661e-05
Time to fill with data: 0.358964
Fill rate with data: 278.579 Mints/sec, 1114.32Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 1.69277e-05
Time to fill with data: 0.356918
Fill rate with data: 280.177 Mints/sec, 1120.71Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 1.50204e-05
     2.000779126        1000.703872      task-clock (msec)
     2.000779126                  1      context-switches          #    0.001 K/sec
     2.000779126                  0      cpu-migrations            #    0.000 K/sec
     2.000779126              2,280      page-faults               #    0.002 M/sec
     2.000779126      2,686,072,244      cycles                    #    2.685 GHz
     2.000779126      3,094,777,285      instructions              #    1.16  insn per cycle
     2.000779126        563,593,105      branches                  #  563.425 M/sec
     2.000779126              9,661      branch-misses             #    0.00% of all branches
Time to fill with data: 0.371785
Fill rate with data: 268.973 Mints/sec, 1075.89Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 1.90735e-05
Time to fill with data: 0.418562
Fill rate with data: 238.913 Mints/sec, 955.653Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 2.09808e-05
     3.001146481        1000.436128      task-clock (msec)
     3.001146481                  1      context-switches          #    0.001 K/sec
     3.001146481                  0      cpu-migrations            #    0.000 K/sec
     3.001146481            217,415      page-faults               #    0.217 M/sec
     3.001146481      2,687,783,783      cycles                    #    2.687 GHz
     3.001146481      3,100,713,038      instructions              #    1.16  insn per cycle
     3.001146481        560,207,049      branches                  #  560.014 M/sec
     3.001146481             83,230      branch-misses             #    0.01% of all branches
Time to fill with data: 0.416297
Fill rate with data: 240.213 Mints/sec, 960.853Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 1.38283e-05
Time to fill with data: 0.41672
Fill rate with data: 239.969 Mints/sec, 959.877Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 1.40667e-05
Time to fill with data: 0.424997
Fill rate with data: 235.296 Mints/sec, 941.183Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 1.28746e-05
     4.001467773        1000.378604      task-clock (msec)
     4.001467773                  2      context-switches          #    0.002 K/sec
     4.001467773                  0      cpu-migrations            #    0.000 K/sec
     4.001467773            232,690      page-faults               #    0.233 M/sec
     4.001467773      2,655,313,682      cycles                    #    2.654 GHz
     4.001467773      3,087,157,016      instructions              #    1.15  insn per cycle
     4.001467773        557,266,313      branches                  #  557.070 M/sec
     4.001467773             95,433      branch-misses             #    0.02% of all branches
Time to fill with data: 0.413271
Fill rate with data: 241.972 Mints/sec, 967.888Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 1.21593e-05
Time to fill with data: 0.414624
Fill rate with data: 241.182 Mints/sec, 964.73Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 1.5974e-05
     5.001792272        1000.372602      task-clock (msec)
     5.001792272                  2      context-switches          #    0.002 K/sec
     5.001792272                  0      cpu-migrations            #    0.000 K/sec
     5.001792272            236,260      page-faults               #    0.236 M/sec
     5.001792272      2,687,340,230      cycles                    #    2.686 GHz
     5.001792272      3,134,864,968      instructions              #    1.17  insn per cycle
     5.001792272        565,846,287      branches                  #  565.644 M/sec
     5.001792272            104,634      branch-misses             #    0.02% of all branches
Time to fill with data: 0.412331
Fill rate with data: 242.524 Mints/sec, 970.094Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 1.3113e-05
Time to fill with data: 0.414433
Fill rate with data: 241.294 Mints/sec, 965.174Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 1.88351e-05
Time to fill with data: 0.417277
Fill rate with data: 239.649 Mints/sec, 958.596Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes
     6.002129544        1000.404270      task-clock (msec)
     6.002129544                  1      context-switches          #    0.001 K/sec
     6.002129544                  0      cpu-migrations            #    0.000 K/sec
     6.002129544            215,269      page-faults               #    0.215 M/sec
     6.002129544      2,676,269,667      cycles                    #    2.675 GHz
     6.002129544      3,286,469,282      instructions              #    1.23  insn per cycle
     6.002129544        578,367,266      branches                  #  578.156 M/sec
     6.002129544            345,470      branch-misses             #    0.06% of all branches
    ....

使用来自https://access.redhat.com/solutions/46111的root命令禁用THP后,我每秒总有~200,000页错误,大约950 MB / s:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ perf stat -I1000 ./a.out
Iteration 0
Time to malloc: 1.50204e-05
Time to fill with data: 0.422322
Fill rate with data: 236.786 Mints/sec, 947.145Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 1.50204e-05
Time to fill with data: 0.415068
Fill rate with data: 240.924 Mints/sec, 963.698Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 2.19345e-05
#           time             counts unit events
     1.000162191         999.429856      task-clock (msec)
     1.000162191                 14      context-switches          #    0.014 K/sec
     1.000162191                  0      cpu-migrations            #    0.000 K/sec
     1.000162191            232,727      page-faults               #    0.233 M/sec
     1.000162191      2,664,896,604      cycles                    #    2.666 GHz
     1.000162191      3,080,713,267      instructions              #    1.16  insn per cycle
     1.000162191        555,116,838      branches                  #  555.434 M/sec
     1.000162191            102,262      branch-misses             #    0.02% of all branches
Time to fill with data: 0.440695
Fill rate with data: 226.914 Mints/sec, 907.658Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 2.09808e-05
Time to fill with data: 0.414463
Fill rate with data: 241.276 Mints/sec, 965.104Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 1.81198e-05
     2.000544564        1000.142465      task-clock (msec)
     2.000544564                 16      context-switches          #    0.016 K/sec
     2.000544564                  0      cpu-migrations            #    0.000 K/sec
     2.000544564            229,697      page-faults               #    0.230 M/sec
     2.000544564      2,621,180,984      cycles                    #    2.622 GHz
     2.000544564      3,041,358,811      instructions              #    1.15  insn per cycle
     2.000544564        547,910,242      branches                  #  548.027 M/sec
     2.000544564             93,682      branch-misses             #    0.02% of all branches
Time to fill with data: 0.428383
Fill rate with data: 233.436 Mints/sec, 933.744Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 1.5974e-05
Time to fill with data: 0.421986
Fill rate with data: 236.975 Mints/sec, 947.899Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 1.5974e-05
Time to fill with data: 0.413477
Fill rate with data: 241.851 Mints/sec, 967.406Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 1.88351e-05
     3.000866438         999.980461      task-clock (msec)
     3.000866438                 20      context-switches          #    0.020 K/sec
     3.000866438                  0      cpu-migrations            #    0.000 K/sec
     3.000866438            231,194      page-faults               #    0.231 M/sec
     3.000866438      2,622,484,960      cycles                    #    2.623 GHz
     3.000866438      3,061,610,229      instructions              #    1.16  insn per cycle
     3.000866438        551,533,361      branches                  #  551.616 M/sec
     3.000866438            104,561      branch-misses             #    0.02% of all branches
Time to fill with data: 0.448333
Fill rate with data: 223.048 Mints/sec, 892.194Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 1.50204e-05
Time to fill with data: 0.410566
Fill rate with data: 243.566 Mints/sec, 974.265Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 1.3113e-05
     4.001231042        1000.098860      task-clock (msec)
     4.001231042                 17      context-switches          #    0.017 K/sec
     4.001231042                  0      cpu-migrations            #    0.000 K/sec
     4.001231042            228,532      page-faults               #    0.229 M/sec
     4.001231042      2,586,146,024      cycles                    #    2.586 GHz
     4.001231042      3,026,679,955      instructions              #    1.15  insn per cycle
     4.001231042        545,236,541      branches                  #  545.284 M/sec
     4.001231042            115,251      branch-misses             #    0.02% of all branches
Time to fill with data: 0.441442
Fill rate with data: 226.53 Mints/sec, 906.121Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 1.5974e-05
Time to fill with data: 0.42898
Fill rate with data: 233.111 Mints/sec, 932.445Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 2.00272e-05
     5.001547227         999.982415      task-clock (msec)
     5.001547227                 19      context-switches          #    0.019 K/sec
     5.001547227                  0      cpu-migrations            #    0.000 K/sec
     5.001547227            225,796      page-faults               #    0.226 M/sec
     5.001547227      2,560,990,918      cycles                    #    2.561 GHz
     5.001547227      3,005,384,743      instructions              #    1.15  insn per cycle
     5.001547227        542,275,580      branches                  #  542.315 M/sec
     5.001547227            116,537      branch-misses             #    0.02% of all branches
Time to fill with data: 0.414212
Fill rate with data: 241.422 Mints/sec, 965.689Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 1.69277e-05
Time to fill with data: 0.411084
Fill rate with data: 243.259 Mints/sec, 973.037Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 1.40667e-05
Time to fill with data: 0.413644
Fill rate with data: 241.754 Mints/sec, 967.015Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 1.28746e-05
     6.001849796         999.913923      task-clock (msec)
     6.001849796                 18      context-switches          #    0.018 K/sec
     6.001849796                  0      cpu-migrations            #    0.000 K/sec
     6.001849796            236,912      page-faults               #    0.237 M/sec
     6.001849796      2,685,445,660      cycles                    #    2.686 GHz
     6.001849796      3,153,464,551      instructions              #    1.20  insn per cycle
     6.001849796        568,989,467      branches                  #  569.032 M/sec
     6.001849796            125,943      branch-misses             #    0.02% of all branches
Time to fill with data: 0.444891
Fill rate with data: 224.774 Mints/sec, 899.097Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes

使用速率打印和有限迭代次数对perf stat进行了测试修改:

$ cat test.c; g++ test.c
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;

double getWallTime()
{
  struct timeval time;
  if (gettimeofday(&time, NULL))
  {
    return 0;
  }
  return (double)time.tv_sec + (double)time.tv_usec * .000001;
}

#define M 1000000

int main()
{
  int *a;
  int n = 100000000;
  int j;
  double total = 0;
  for(j=0; j<15; j++)
  {
    cout << "Iteration " << j << endl;
    double start = getWallTime();
    a = new int[n];
    cout << "Time to malloc: " << getWallTime() - start << endl;
    for (int i = 0; i < n; i++)
    {
      a[i] = 1;
    }
    double elapsed = getWallTime()-start;
    cout << "Time to fill with data: " << elapsed << endl;
    cout << "Fill rate with data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec"  << endl;
    total += n*sizeof(int)*1./M;
    cout << "Allocated " << n*sizeof(int)*1./M << " Mbytes, with total memory allocated " << total << "Mbytes" << endl;
  }

  return 0;
}

为第二次和第三次写入访问修改了测试

$ g++ second.c -o second
$ cat second.c
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;

double getWallTime()
{
  struct timeval time;
  if (gettimeofday(&time, NULL))
  {
    return 0;
  }
  return (double)time.tv_sec + (double)time.tv_usec * .000001;
}

#define M 1000000

int main()
{
  int *a;
  int n = 100000000;
  int j;
  double total = 0;
  for(j=0; j<15; j++)
  {
    cout << "Iteration " << j << endl;
    double start = getWallTime();
    a = new int[n];
    cout << "Time to malloc: " << getWallTime() - start << endl;
    for (int i = 0; i < n; i++)
    {
      a[i] = 1;
    }
    double elapsed = getWallTime()-start;
    cout << "Time to fill with data: " << elapsed << endl;
    cout << "Fill rate with data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec"  << endl;


    start = getWallTime();
    for (int i = 0; i < n; i++)
    {
      a[i] = 2;
    }
    elapsed = getWallTime()-start;
    cout << "Time to second write access of data: " << elapsed << endl;
    cout << "Access rate of data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec"  << endl;

    start = getWallTime();
    for (int i = 0; i < n; i++)
    {
      a[i] = 3;
    }
    elapsed = getWallTime()-start;
    cout << "Time to third write access of data: " << elapsed << endl;
    cout << "Access rate of data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec"  << endl;


    total += n*sizeof(int)*1./M;
    cout << "Allocated " << n*sizeof(int)*1./M << " Mbytes, with total memory allocated " << total << "Mbytes" << endl;
  }

  return 0;
}

没有THP - 第二次和第三次访问大约1.25 GB / s:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ ./second
Iteration 0
Time to malloc: 9.05991e-06
Time to fill with data: 0.426387
Fill rate with data: 234.529 Mints/sec, 938.115Mbytes/sec
Time to second write access of data: 0.318292
Access rate of data: 314.177 Mints/sec, 1256.71Mbytes/sec
Time to third write access of data: 0.321722
Access rate of data: 310.827 Mints/sec, 1243.31Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 3.50475e-05
Time to fill with data: 0.411859
Fill rate with data: 242.802 Mints/sec, 971.206Mbytes/sec
Time to second write access of data: 0.317989
Access rate of data: 314.476 Mints/sec, 1257.91Mbytes/sec
Time to third write access of data: 0.321637
Access rate of data: 310.91 Mints/sec, 1243.64Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 2.81334e-05
Time to fill with data: 0.411918
Fill rate with data: 242.767 Mints/sec, 971.067Mbytes/sec
Time to second write access of data: 0.318647
Access rate of data: 313.827 Mints/sec, 1255.31Mbytes/sec
Time to third write access of data: 0.321041
Access rate of data: 311.487 Mints/sec, 1245.95Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 2.5034e-05
Time to fill with data: 0.411138
Fill rate with data: 243.227 Mints/sec, 972.909Mbytes/sec
Time to second write access of data: 0.318429
Access rate of data: 314.042 Mints/sec, 1256.17Mbytes/sec
Time to third write access of data: 0.321332
Access rate of data: 311.205 Mints/sec, 1244.82Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 3.71933e-05
Time to fill with data: 0.410922
Fill rate with data: 243.355 Mints/sec, 973.421Mbytes/sec
Time to second write access of data: 0.320262
Access rate of data: 312.244 Mints/sec, 1248.98Mbytes/sec
Time to third write access of data: 0.319223
Access rate of data: 313.261 Mints/sec, 1253.04Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 2.19345e-05
Time to fill with data: 0.418508
Fill rate with data: 238.944 Mints/sec, 955.777Mbytes/sec
Time to second write access of data: 0.320419
Access rate of data: 312.092 Mints/sec, 1248.37Mbytes/sec
Time to third write access of data: 0.319752
Access rate of data: 312.742 Mints/sec, 1250.97Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 3.19481e-05
Time to fill with data: 0.410054
Fill rate with data: 243.87 Mints/sec, 975.481Mbytes/sec
Time to second write access of data: 0.320244
Access rate of data: 312.262 Mints/sec, 1249.05Mbytes/sec
Time to third write access of data: 0.319546
Access rate of data: 312.944 Mints/sec, 1251.78Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 3.19481e-05
Time to fill with data: 0.409491
Fill rate with data: 244.206 Mints/sec, 976.822Mbytes/sec
Time to second write access of data: 0.318501
Access rate of data: 313.971 Mints/sec, 1255.88Mbytes/sec
Time to third write access of data: 0.320052
Access rate of data: 312.449 Mints/sec, 1249.8Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 2.5034e-05
Time to fill with data: 0.409922
Fill rate with data: 243.949 Mints/sec, 975.795Mbytes/sec
Time to second write access of data: 0.320583
Access rate of data: 311.932 Mints/sec, 1247.73Mbytes/sec
Time to third write access of data: 0.319478
Access rate of data: 313.011 Mints/sec, 1252.04Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 2.69413e-05
Time to fill with data: 0.41104
Fill rate with data: 243.285 Mints/sec, 973.141Mbytes/sec
Time to second write access of data: 0.320389
Access rate of data: 312.121 Mints/sec, 1248.48Mbytes/sec
Time to third write access of data: 0.319762
Access rate of data: 312.733 Mints/sec, 1250.93Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 2.59876e-05
Time to fill with data: 0.412612
Fill rate with data: 242.358 Mints/sec, 969.434Mbytes/sec
Time to second write access of data: 0.318304
Access rate of data: 314.165 Mints/sec, 1256.66Mbytes/sec
Time to third write access of data: 0.319453
Access rate of data: 313.035 Mints/sec, 1252.14Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 2.98023e-05
Time to fill with data: 0.412428
Fill rate with data: 242.467 Mints/sec, 969.866Mbytes/sec
Time to second write access of data: 0.318467
Access rate of data: 314.004 Mints/sec, 1256.02Mbytes/sec
Time to third write access of data: 0.319716
Access rate of data: 312.778 Mints/sec, 1251.11Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 2.69413e-05
Time to fill with data: 0.410515
Fill rate with data: 243.597 Mints/sec, 974.386Mbytes/sec
Time to second write access of data: 0.31832
Access rate of data: 314.149 Mints/sec, 1256.6Mbytes/sec
Time to third write access of data: 0.319569
Access rate of data: 312.921 Mints/sec, 1251.69Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 2.28882e-05
Time to fill with data: 0.412385
Fill rate with data: 242.492 Mints/sec, 969.967Mbytes/sec
Time to second write access of data: 0.318929
Access rate of data: 313.549 Mints/sec, 1254.2Mbytes/sec
Time to third write access of data: 0.31949
Access rate of data: 312.999 Mints/sec, 1252Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 2.90871e-05
Time to fill with data: 0.41235
Fill rate with data: 242.512 Mints/sec, 970.05Mbytes/sec
Time to second write access of data: 0.340456
Access rate of data: 293.724 Mints/sec, 1174.89Mbytes/sec
Time to third write access of data: 0.319716
Access rate of data: 312.778 Mints/sec, 1251.11Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes

使用THP - 分配速度更快但第二次和第三次访问的速度相同:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$ ./second
Iteration 0
Time to malloc: 1.50204e-05
Time to fill with data: 0.365043
Fill rate with data: 273.94 Mints/sec, 1095.76Mbytes/sec
Time to second write access of data: 0.320503
Access rate of data: 312.01 Mints/sec, 1248.04Mbytes/sec
Time to third write access of data: 0.319442
Access rate of data: 313.046 Mints/sec, 1252.18Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
...
Iteration 14
Time to malloc: 2.7895e-05
Time to fill with data: 0.409294
Fill rate with data: 244.323 Mints/sec, 977.293Mbytes/sec
Time to second write access of data: 0.318422
Access rate of data: 314.049 Mints/sec, 1256.19Mbytes/sec
Time to third write access of data: 0.322098
Access rate of data: 310.465 Mints/sec, 1241.86Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes