我写了一个简单的程序来研究在Linux上使用大量RAM时的性能(64位红帽企业Linux服务器版本6.4)。 (请忽略内存泄漏。)
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;
double getWallTime()
{
struct timeval time;
if (gettimeofday(&time, NULL))
{
return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}
int main()
{
int *a;
int n = 1000000000;
do
{
time_t mytime = time(NULL);
char * time_str = ctime(&mytime);
time_str[strlen(time_str)-1] = '\0';
printf("Current Time : %s\n", time_str);
double start = getWallTime();
a = new int[n];
for (int i = 0; i < n; i++)
{
a[i] = 1;
}
double elapsed = getWallTime()-start;
cout << elapsed << endl;
cout << "Allocated." << endl;
}
while (1);
return 0;
}
输出
Current Time : Tue May 8 11:46:55 2018
3.73667
Allocated.
Current Time : Tue May 8 11:46:59 2018
64.5222
Allocated.
Current Time : Tue May 8 11:48:03 2018
110.419
最高输出如下。尽管有足够的可用RAM,但我们可以看到交换增加。结果是运行时间从3秒猛增到64秒。
top - 11:46:55 up 21 days, 1:14, 18 users, load average: 1.24, 1.25, 0.95
Tasks: 819 total, 3 running, 816 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.6%us, 1.4%sy, 0.0%ni, 97.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 127500344k used, 4609744k free, 262288k buffers
Swap: 10485752k total, 4112k used, 10481640k free, 45988192k cached
top - 11:47:01 up 21 days, 1:14, 18 users, load average: 1.38, 1.27, 0.96
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.5%us, 2.1%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131620156k used, 489932k free, 262288k buffers
Swap: 10485752k total, 4112k used, 10481640k free, 45844228k cached
top - 11:47:53 up 21 days, 1:15, 18 users, load average: 1.25, 1.26, 0.97
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 2.5%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131626300k used, 483788k free, 262276k buffers
Swap: 10485752k total, 5464k used, 10480288k free, 43056696k cached
top - 11:47:56 up 21 days, 1:15, 18 users, load average: 1.23, 1.26, 0.97
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 2.5%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131627568k used, 482520k free, 262276k buffers
Swap: 10485752k total, 5792k used, 10479960k free, 42949788k cached
top - 11:47:59 up 21 days, 1:15, 18 users, load average: 1.21, 1.25, 0.97
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 2.5%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131623080k used, 487008k free, 262276k buffers
Swap: 10485752k total, 6312k used, 10479440k free, 42840068k cached
top - 11:48:02 up 21 days, 1:15, 18 users, load average: 1.21, 1.25, 0.97
Tasks: 819 total, 2 running, 817 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 2.5%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132110088k total, 131620016k used, 490072k free, 262276k buffers
Swap: 10485752k total, 6772k used, 10478980k free, 42729276k cached
为什么Linux会牺牲性能而不是完全使用缓存的RAM?内存碎片?但是将数据放在交换上肯定会造成碎片化。
是否有一种解决方法可以在达到物理RAM大小之前获得一致的3秒钟?
感谢。
更新1: 从顶部添加更多输出。
更新2: 按照David的建议,查看/ proc // io显示我的程序没有I / O.所以大卫的第一个答案应该解释这个观察。现在谈谈我的第二个问题。如何以非root用户身份提高性能(无法修改swappiness等)。
更新3:我切换到另一台机器,因为我需要sudo一些命令。这是一台真正的机器(没有虚拟机),带有Intel(R)Xeon(R)CPU E5-2680 0 @ 2.70GHz。该机器有16个物理核心。
uname -a
2.6.32-642.4.2.el6.x86_64 #1 SMP Tue Aug 23 19:58:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
运行带有更多迭代的osgx修改代码
Iteration 451
Time to malloc: 1.81198e-05
Time to fill with data: 0.109081
Fill rate with data: **916**.75 Mints/sec, 3667Mbytes/sec
Time to second write access of data: 0.049731
Access rate of data: 2010.82 Mints/sec, 8043.27Mbytes/sec
Time to third write access of data: 0.0478709
Access rate of data: 2088.95 Mints/sec, 8355.81Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 180800Mbytes
Iteration 452
Time to malloc: 1.09673e-05
Time to fill with data: 5.16316
Fill rate with data: **19**.368 Mints/sec, 77.4719Mbytes/sec
Time to second write access of data: 0.0495219
Access rate of data: 2019.31 Mints/sec, 8077.23Mbytes/sec
Time to third write access of data: 0.0439548
Access rate of data: 2275.06 Mints/sec, 9100.25Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 181200Mbytes
当发生减速时,我确实看到内核从2MB页面切换到4KB页面。
vmstat 1 60
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 1217396 11506356 5911040 47499184 0 2 35 47 0 0 14 2 84 0 0
2 0 1217396 11305860 5911040 47499184 4 0 4 36 5163 3460 7 6 87 0 0
2 0 1217396 11112744 5911040 47499188 0 0 0 0 4326 3451 7 6 87 0 0
2 0 1217396 10980556 5911040 47499188 0 0 0 0 4801 3385 7 6 87 0 0
2 0 1217396 10845940 5911040 47499192 0 0 0 20 4650 3596 7 6 87 0 0
2 0 1217396 10712508 5911040 47499200 0 0 0 0 5743 3562 7 6 87 0 0
2 0 1217396 10583380 5911040 47499200 0 0 0 40 4531 3622 7 6 87 0 0
2 0 1217396 10449096 5911040 47499200 0 0 0 0 4516 3629 7 6 87 0 0
2 0 1217396 10187856 5911040 47499200 0 0 0 0 4499 3456 7 6 87 0 0
2 0 1217396 10053256 5911040 47499204 0 0 0 8 5334 3507 7 6 87 0 0
2 0 1217396 9921624 5911040 47499204 0 0 0 0 6310 3593 6 6 87 0 0
2 0 1217396 9788532 5911040 47499208 0 0 0 44 5794 3516 7 6 87 0 0
2 0 1217396 9660516 5911040 47499208 0 0 0 0 4894 3535 7 6 87 0 0
2 0 1217396 9527552 5911040 47499212 0 0 0 0 4686 3570 7 6 87 0 0
2 0 1217396 9396536 5911040 47499212 0 0 0 0 4805 3538 7 6 87 0 0
2 0 1217396 9238664 5911040 47499212 0 0 0 0 5940 3459 7 6 87 0 0
2 0 1217396 9000136 5911040 47499216 0 0 0 32 5239 3333 7 6 87 0 0
2 0 1217396 8861132 5911040 47499220 0 0 0 0 5579 3351 7 6 87 0 0
2 0 1217396 8733688 5911040 47499220 0 0 0 0 4910 3199 7 6 87 0 0
2 0 1217396 8596600 5911040 47499224 0 0 0 44 5075 3453 7 6 87 0 0
2 0 1217396 8338468 5911040 47499232 0 0 0 0 5328 3444 7 6 87 0 0
2 0 1217396 8207732 5911040 47499232 0 0 0 52 5474 3370 7 6 87 0 0
2 0 1217396 8071212 5911040 47499236 0 0 0 0 5442 3419 7 6 87 0 0
2 0 1217396 7807736 5911040 47499236 0 0 0 0 6139 3456 7 6 87 0 0
2 0 1217396 7676080 5911044 47499232 0 0 0 16 4533 3430 6 6 87 0 0
2 0 1217396 7545728 5911044 47499236 0 0 0 0 6712 3957 7 6 87 0 0
4 0 1217396 7412444 5911044 47499240 0 0 0 68 6110 3547 7 6 87 0 0
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 1217396 7280148 5911048 47499244 0 0 0 68 6140 3516 7 7 86 0 0
2 0 1217396 7147836 5911048 47499244 0 0 0 0 4434 3400 7 6 87 0 0
2 0 1217396 6886980 5911048 47499248 0 0 0 16 7354 3393 7 6 87 0 0
2 0 1217396 6752868 5911048 47499248 0 0 0 0 5286 3573 7 6 87 0 0
2 0 1217396 6621772 5911048 47499248 0 0 0 0 5353 3410 7 6 87 0 0
2 0 1217396 6489760 5911048 47499252 0 0 0 48 5172 3454 7 6 87 0 0
2 0 1217396 6248732 5911048 47499256 0 0 0 0 5266 3411 7 6 87 0 0
2 0 1217396 6092804 5911048 47499260 0 0 0 4 6345 3473 7 6 87 0 0
2 0 1217396 5962544 5911048 47499260 0 0 0 0 7399 3712 7 6 87 0 0
2 0 1217396 5828492 5911048 47499264 0 0 0 0 5804 3516 7 6 87 0 0
2 0 1217396 5566720 5911048 47499264 0 0 0 44 5800 3370 7 6 87 0 0
2 0 1217396 5434204 5911048 47499264 0 0 0 0 6716 3446 7 6 87 0 0
2 0 1217396 5240724 5911048 47499268 0 0 0 68 3948 3346 7 6 87 0 0
2 0 1217396 5051688 5911008 47484936 0 0 0 0 4743 3734 7 6 87 0 0
2 0 1217396 4925680 5910500 47478444 0 0 136 0 5978 3779 7 6 87 0 0
2 0 1217396 4801744 5908552 47471820 0 0 0 32 4573 3237 7 6 87 0 0
2 0 1217396 4675772 5908552 47463984 0 0 0 0 6594 3276 7 6 87 0 0
2 0 1217396 4486472 5908444 47455736 0 0 0 4 6096 3256 7 6 87 0 0
2 0 1217396 4299908 5908392 47446964 0 0 0 0 5569 3525 7 6 87 0 0
2 0 1217396 4175444 5906884 47440024 0 0 0 0 4975 3141 7 6 87 0 0
2 0 1217396 4063472 5905976 47423860 0 0 0 56 6255 3147 6 6 87 0 0
2 0 1217396 3939816 5905796 47415596 0 0 0 0 5396 3143 7 6 87 0 0
2 0 1217396 3686540 5905796 47407152 0 0 0 44 6471 3201 7 6 87 0 0
2 0 1217396 3557596 5905796 47398892 0 0 0 0 7581 3727 7 6 87 0 0
2 0 1217396 3445536 5905796 47381812 0 0 0 0 5560 3222 7 6 87 0 0
2 0 1217396 3250272 5905796 47373364 0 0 0 60 5594 3343 7 6 87 0 0
2 0 1217396 3065232 5903744 47367156 0 0 0 0 5595 3182 7 6 87 0 0
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 1217396 2951704 5903028 47350792 0 0 0 12 5210 3262 7 6 87 0 0
2 0 1217396 2829228 5902928 47342444 0 0 0 0 5724 3758 7 6 87 0 0
2 0 1217396 2575248 5902580 47334472 0 0 0 0 4377 3369 7 6 87 0 0
2 0 1217396 2527996 5897796 47322436 0 0 0 60 5550 3570 7 6 87 0 0
2 0 1217396 2398672 5893572 47322324 0 0 0 0 5603 3225 7 6 87 0 0
2 0 1217396 2272536 5889364 47322228 0 0 0 16 6924 3310 7 6 87 0 0
iostat -xyz 1 60
Linux 2.6.32-642.4.2.el6.x86_64 05/09/2018 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
6.64 0.00 6.26 0.00 0.00 87.10
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
avg-cpu: %user %nice %system %iowait %steal %idle
7.00 0.06 5.69 0.00 0.00 87.24
我设法做了“sudo perf top”,并在发生减速时看到了这一点。
16.84% [kernel] [k] compaction_alloc
从顶部开始。还有其他几个进程在运行(未显示)。
Tasks: 799 total, 5 running, 787 sleeping, 4 stopped, 3 zombie
Cpu(s): 23.1%us, 16.7%sy, 0.0%ni, 60.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 264503640k total, 256749480k used, 7754160k free, 5830508k buffers
Swap: 409259004k total, 1217112k used, 408041892k free, 50458600k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23559 toddwz 20 0 165g 164g 1204 R 93.0 65.4 2:05.51 a.out
更新4 关闭THP后,我看到以下内容。在我的程序使用240GB RAM(缓存RAM <1GB)之前,填充率大约为550 Mint / sec(THP开启为900)。然后交换开始,所以填充率下降。
Iteration 610
Time to malloc: 1.3113e-05
Time to fill with data: 0.181151
Fill rate with data: 552.025 Mints/sec, 2208.1Mbytes/sec
Time to second write access of data: 0.04074
Access rate of data: 2454.59 Mints/sec, 9818.36Mbytes/sec
Time to third write access of data: 0.0420492
Access rate of data: 2378.17 Mints/sec, 9512.67Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 244400Mbytes
Iteration 611
Time to malloc: 1.88351e-05
Time to fill with data: 0.306215
Fill rate with data: 326.568 Mints/sec, 1306.27Mbytes/sec
Time to second write access of data: 0.045784
Access rate of data: 2184.17 Mints/sec, 8736.68Mbytes/sec
Time to third write access of data: 0.0441492
Access rate of data: 2265.05 Mints/sec, 9060.19Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 244800Mbytes
Iteration 612
Time to malloc: 2.21729e-05
Time to fill with data: 1.33305
Fill rate with data: 75.016 Mints/sec, 300.064Mbytes/sec
Time to second write access of data: 0.048573
Access rate of data: 2058.76 Mints/sec, 8235.02Mbytes/sec
Time to third write access of data: 0.0495481
Access rate of data: 2018.24 Mints/sec, 8072.96Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 245200Mbytes
结论 关闭透明大页面(THP),我的程序行为对我来说更透明,所以我将继续关闭THP。对于我的特定程序,原因是THP不交换。感谢所有帮助过的人。
答案 0 :(得分:2)
由于THP,测试的第一次迭代可能会使用huge pages (2 MB pages):透明巨页 - https://www.kernel.org/doc/Documentation/vm/transhuge.txt -
在执行测试期间检查/ sys / kernel / mm / transparent_hugepage / enabled和grep AnonHugePages /proc/meminfo
。
应用程序运行速度更快的原因是两个 因素。第一个因素几乎完全不相关,但事实并非如此 非常感兴趣,因为它也有下行空间 在页面错误中需要更大的清晰页面复制页面 潜在的负面影响。第一个因素包括采取 用户区触及的每个2M虚拟区域的单页错误(如此 将进入/退出内核频率降低512倍因子)。这个 唯一重要的是第一次访问内存的生命周期 内存映射。
使用new
或malloc
分配大量内存由单个系统调用mmap
提供,这通常不会填充&#34;有物理页面的虚拟内存,请在MADV_POPULATE:
man mmap
MAP_POPULATE (since Linux 2.5.46)
Populate (prefault) page tables for a mapping. ... This will help
to reduce blocking on page faults later.
此内存仅由mmap(没有MAP_POPULATE)注册,因为页表中禁止虚拟和写访问。当您的测试尝试首先写入任何内存页时,操作系统内核会生成并处理页面错误异常。 Linux内核将分配一些物理内存并将虚拟页面映射到物理(填充页面)。启用THP(通常启用),内核可以分配单个huge page of 2MB,如果它有一些免费的大型物理页面。如果没有免费的大页面,内核将分配4KB页面。因此,如果没有大页面,您将有512倍的页面错误(可以通过在测试运行时在另一个控制台中运行vmstat 1 180
或perf stat -I 1000
)来检查。
对填充页面的下一次访问不会出现页面错误,因此您可以使用第二个(第三个)for i in (0..N-1): a[i] = 1;
循环扩展测试并测量两个循环的时间。
你的结果听起来仍然很奇怪。您的系统是真实的还是虚拟化的?管理程序可能支持2 MB页面,虚拟系统可能需要更多的内存分配和异常处理成本。
在内存较少的PC上,当页面错误从大页面分配切换到4KB页面分配时,我有10%的速度减慢(从page-faults
检查perf stat
字符串 - 只有大约2千个每秒页面错误,2MB页面和> 200,000页错误,4KB页面):
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$ perf stat -I1000 ./a.out
Iteration 0
Time to malloc: 8.10623e-06
Time to fill with data: 0.364378
Fill rate with data: 274.44 Mints/sec, 1097.76Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 1.90735e-05
Time to fill with data: 0.357983
Fill rate with data: 279.343 Mints/sec, 1117.37Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 1.69277e-05
# time counts unit events
1.000414902 999.893040 task-clock (msec)
1.000414902 1 context-switches # 0.001 K/sec
1.000414902 0 cpu-migrations # 0.000 K/sec
1.000414902 2,024 page-faults # 0.002 M/sec
1.000414902 2,664,963,857 cycles # 2.665 GHz
1.000414902 3,072,781,834 instructions # 1.15 insn per cycle
1.000414902 559,551,437 branches # 559.611 M/sec
1.000414902 25,176 branch-misses # 0.00% of all branches
Time to fill with data: 0.357014
Fill rate with data: 280.101 Mints/sec, 1120.4Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 1.71661e-05
Time to fill with data: 0.358964
Fill rate with data: 278.579 Mints/sec, 1114.32Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 1.69277e-05
Time to fill with data: 0.356918
Fill rate with data: 280.177 Mints/sec, 1120.71Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 1.50204e-05
2.000779126 1000.703872 task-clock (msec)
2.000779126 1 context-switches # 0.001 K/sec
2.000779126 0 cpu-migrations # 0.000 K/sec
2.000779126 2,280 page-faults # 0.002 M/sec
2.000779126 2,686,072,244 cycles # 2.685 GHz
2.000779126 3,094,777,285 instructions # 1.16 insn per cycle
2.000779126 563,593,105 branches # 563.425 M/sec
2.000779126 9,661 branch-misses # 0.00% of all branches
Time to fill with data: 0.371785
Fill rate with data: 268.973 Mints/sec, 1075.89Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 1.90735e-05
Time to fill with data: 0.418562
Fill rate with data: 238.913 Mints/sec, 955.653Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 2.09808e-05
3.001146481 1000.436128 task-clock (msec)
3.001146481 1 context-switches # 0.001 K/sec
3.001146481 0 cpu-migrations # 0.000 K/sec
3.001146481 217,415 page-faults # 0.217 M/sec
3.001146481 2,687,783,783 cycles # 2.687 GHz
3.001146481 3,100,713,038 instructions # 1.16 insn per cycle
3.001146481 560,207,049 branches # 560.014 M/sec
3.001146481 83,230 branch-misses # 0.01% of all branches
Time to fill with data: 0.416297
Fill rate with data: 240.213 Mints/sec, 960.853Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 1.38283e-05
Time to fill with data: 0.41672
Fill rate with data: 239.969 Mints/sec, 959.877Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 1.40667e-05
Time to fill with data: 0.424997
Fill rate with data: 235.296 Mints/sec, 941.183Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 1.28746e-05
4.001467773 1000.378604 task-clock (msec)
4.001467773 2 context-switches # 0.002 K/sec
4.001467773 0 cpu-migrations # 0.000 K/sec
4.001467773 232,690 page-faults # 0.233 M/sec
4.001467773 2,655,313,682 cycles # 2.654 GHz
4.001467773 3,087,157,016 instructions # 1.15 insn per cycle
4.001467773 557,266,313 branches # 557.070 M/sec
4.001467773 95,433 branch-misses # 0.02% of all branches
Time to fill with data: 0.413271
Fill rate with data: 241.972 Mints/sec, 967.888Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 1.21593e-05
Time to fill with data: 0.414624
Fill rate with data: 241.182 Mints/sec, 964.73Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 1.5974e-05
5.001792272 1000.372602 task-clock (msec)
5.001792272 2 context-switches # 0.002 K/sec
5.001792272 0 cpu-migrations # 0.000 K/sec
5.001792272 236,260 page-faults # 0.236 M/sec
5.001792272 2,687,340,230 cycles # 2.686 GHz
5.001792272 3,134,864,968 instructions # 1.17 insn per cycle
5.001792272 565,846,287 branches # 565.644 M/sec
5.001792272 104,634 branch-misses # 0.02% of all branches
Time to fill with data: 0.412331
Fill rate with data: 242.524 Mints/sec, 970.094Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 1.3113e-05
Time to fill with data: 0.414433
Fill rate with data: 241.294 Mints/sec, 965.174Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 1.88351e-05
Time to fill with data: 0.417277
Fill rate with data: 239.649 Mints/sec, 958.596Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes
6.002129544 1000.404270 task-clock (msec)
6.002129544 1 context-switches # 0.001 K/sec
6.002129544 0 cpu-migrations # 0.000 K/sec
6.002129544 215,269 page-faults # 0.215 M/sec
6.002129544 2,676,269,667 cycles # 2.675 GHz
6.002129544 3,286,469,282 instructions # 1.23 insn per cycle
6.002129544 578,367,266 branches # 578.156 M/sec
6.002129544 345,470 branch-misses # 0.06% of all branches
....
使用来自https://access.redhat.com/solutions/46111的root命令禁用THP后,我每秒总有~200,000页错误,大约950 MB / s:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ perf stat -I1000 ./a.out
Iteration 0
Time to malloc: 1.50204e-05
Time to fill with data: 0.422322
Fill rate with data: 236.786 Mints/sec, 947.145Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 1.50204e-05
Time to fill with data: 0.415068
Fill rate with data: 240.924 Mints/sec, 963.698Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 2.19345e-05
# time counts unit events
1.000162191 999.429856 task-clock (msec)
1.000162191 14 context-switches # 0.014 K/sec
1.000162191 0 cpu-migrations # 0.000 K/sec
1.000162191 232,727 page-faults # 0.233 M/sec
1.000162191 2,664,896,604 cycles # 2.666 GHz
1.000162191 3,080,713,267 instructions # 1.16 insn per cycle
1.000162191 555,116,838 branches # 555.434 M/sec
1.000162191 102,262 branch-misses # 0.02% of all branches
Time to fill with data: 0.440695
Fill rate with data: 226.914 Mints/sec, 907.658Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 2.09808e-05
Time to fill with data: 0.414463
Fill rate with data: 241.276 Mints/sec, 965.104Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 1.81198e-05
2.000544564 1000.142465 task-clock (msec)
2.000544564 16 context-switches # 0.016 K/sec
2.000544564 0 cpu-migrations # 0.000 K/sec
2.000544564 229,697 page-faults # 0.230 M/sec
2.000544564 2,621,180,984 cycles # 2.622 GHz
2.000544564 3,041,358,811 instructions # 1.15 insn per cycle
2.000544564 547,910,242 branches # 548.027 M/sec
2.000544564 93,682 branch-misses # 0.02% of all branches
Time to fill with data: 0.428383
Fill rate with data: 233.436 Mints/sec, 933.744Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 1.5974e-05
Time to fill with data: 0.421986
Fill rate with data: 236.975 Mints/sec, 947.899Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 1.5974e-05
Time to fill with data: 0.413477
Fill rate with data: 241.851 Mints/sec, 967.406Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 1.88351e-05
3.000866438 999.980461 task-clock (msec)
3.000866438 20 context-switches # 0.020 K/sec
3.000866438 0 cpu-migrations # 0.000 K/sec
3.000866438 231,194 page-faults # 0.231 M/sec
3.000866438 2,622,484,960 cycles # 2.623 GHz
3.000866438 3,061,610,229 instructions # 1.16 insn per cycle
3.000866438 551,533,361 branches # 551.616 M/sec
3.000866438 104,561 branch-misses # 0.02% of all branches
Time to fill with data: 0.448333
Fill rate with data: 223.048 Mints/sec, 892.194Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 1.50204e-05
Time to fill with data: 0.410566
Fill rate with data: 243.566 Mints/sec, 974.265Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 1.3113e-05
4.001231042 1000.098860 task-clock (msec)
4.001231042 17 context-switches # 0.017 K/sec
4.001231042 0 cpu-migrations # 0.000 K/sec
4.001231042 228,532 page-faults # 0.229 M/sec
4.001231042 2,586,146,024 cycles # 2.586 GHz
4.001231042 3,026,679,955 instructions # 1.15 insn per cycle
4.001231042 545,236,541 branches # 545.284 M/sec
4.001231042 115,251 branch-misses # 0.02% of all branches
Time to fill with data: 0.441442
Fill rate with data: 226.53 Mints/sec, 906.121Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 1.5974e-05
Time to fill with data: 0.42898
Fill rate with data: 233.111 Mints/sec, 932.445Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 2.00272e-05
5.001547227 999.982415 task-clock (msec)
5.001547227 19 context-switches # 0.019 K/sec
5.001547227 0 cpu-migrations # 0.000 K/sec
5.001547227 225,796 page-faults # 0.226 M/sec
5.001547227 2,560,990,918 cycles # 2.561 GHz
5.001547227 3,005,384,743 instructions # 1.15 insn per cycle
5.001547227 542,275,580 branches # 542.315 M/sec
5.001547227 116,537 branch-misses # 0.02% of all branches
Time to fill with data: 0.414212
Fill rate with data: 241.422 Mints/sec, 965.689Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 1.69277e-05
Time to fill with data: 0.411084
Fill rate with data: 243.259 Mints/sec, 973.037Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 1.40667e-05
Time to fill with data: 0.413644
Fill rate with data: 241.754 Mints/sec, 967.015Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 1.28746e-05
6.001849796 999.913923 task-clock (msec)
6.001849796 18 context-switches # 0.018 K/sec
6.001849796 0 cpu-migrations # 0.000 K/sec
6.001849796 236,912 page-faults # 0.237 M/sec
6.001849796 2,685,445,660 cycles # 2.686 GHz
6.001849796 3,153,464,551 instructions # 1.20 insn per cycle
6.001849796 568,989,467 branches # 569.032 M/sec
6.001849796 125,943 branch-misses # 0.02% of all branches
Time to fill with data: 0.444891
Fill rate with data: 224.774 Mints/sec, 899.097Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes
使用速率打印和有限迭代次数对perf stat进行了测试修改:
$ cat test.c; g++ test.c
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;
double getWallTime()
{
struct timeval time;
if (gettimeofday(&time, NULL))
{
return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}
#define M 1000000
int main()
{
int *a;
int n = 100000000;
int j;
double total = 0;
for(j=0; j<15; j++)
{
cout << "Iteration " << j << endl;
double start = getWallTime();
a = new int[n];
cout << "Time to malloc: " << getWallTime() - start << endl;
for (int i = 0; i < n; i++)
{
a[i] = 1;
}
double elapsed = getWallTime()-start;
cout << "Time to fill with data: " << elapsed << endl;
cout << "Fill rate with data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec" << endl;
total += n*sizeof(int)*1./M;
cout << "Allocated " << n*sizeof(int)*1./M << " Mbytes, with total memory allocated " << total << "Mbytes" << endl;
}
return 0;
}
为第二次和第三次写入访问修改了测试
$ g++ second.c -o second
$ cat second.c
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <vector>
using namespace std;
double getWallTime()
{
struct timeval time;
if (gettimeofday(&time, NULL))
{
return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}
#define M 1000000
int main()
{
int *a;
int n = 100000000;
int j;
double total = 0;
for(j=0; j<15; j++)
{
cout << "Iteration " << j << endl;
double start = getWallTime();
a = new int[n];
cout << "Time to malloc: " << getWallTime() - start << endl;
for (int i = 0; i < n; i++)
{
a[i] = 1;
}
double elapsed = getWallTime()-start;
cout << "Time to fill with data: " << elapsed << endl;
cout << "Fill rate with data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec" << endl;
start = getWallTime();
for (int i = 0; i < n; i++)
{
a[i] = 2;
}
elapsed = getWallTime()-start;
cout << "Time to second write access of data: " << elapsed << endl;
cout << "Access rate of data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec" << endl;
start = getWallTime();
for (int i = 0; i < n; i++)
{
a[i] = 3;
}
elapsed = getWallTime()-start;
cout << "Time to third write access of data: " << elapsed << endl;
cout << "Access rate of data: " << n/elapsed/M << " Mints/sec, " << n*sizeof(int)/elapsed/M << "Mbytes/sec" << endl;
total += n*sizeof(int)*1./M;
cout << "Allocated " << n*sizeof(int)*1./M << " Mbytes, with total memory allocated " << total << "Mbytes" << endl;
}
return 0;
}
没有THP - 第二次和第三次访问大约1.25 GB / s:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ ./second
Iteration 0
Time to malloc: 9.05991e-06
Time to fill with data: 0.426387
Fill rate with data: 234.529 Mints/sec, 938.115Mbytes/sec
Time to second write access of data: 0.318292
Access rate of data: 314.177 Mints/sec, 1256.71Mbytes/sec
Time to third write access of data: 0.321722
Access rate of data: 310.827 Mints/sec, 1243.31Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
Iteration 1
Time to malloc: 3.50475e-05
Time to fill with data: 0.411859
Fill rate with data: 242.802 Mints/sec, 971.206Mbytes/sec
Time to second write access of data: 0.317989
Access rate of data: 314.476 Mints/sec, 1257.91Mbytes/sec
Time to third write access of data: 0.321637
Access rate of data: 310.91 Mints/sec, 1243.64Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 800Mbytes
Iteration 2
Time to malloc: 2.81334e-05
Time to fill with data: 0.411918
Fill rate with data: 242.767 Mints/sec, 971.067Mbytes/sec
Time to second write access of data: 0.318647
Access rate of data: 313.827 Mints/sec, 1255.31Mbytes/sec
Time to third write access of data: 0.321041
Access rate of data: 311.487 Mints/sec, 1245.95Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1200Mbytes
Iteration 3
Time to malloc: 2.5034e-05
Time to fill with data: 0.411138
Fill rate with data: 243.227 Mints/sec, 972.909Mbytes/sec
Time to second write access of data: 0.318429
Access rate of data: 314.042 Mints/sec, 1256.17Mbytes/sec
Time to third write access of data: 0.321332
Access rate of data: 311.205 Mints/sec, 1244.82Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 1600Mbytes
Iteration 4
Time to malloc: 3.71933e-05
Time to fill with data: 0.410922
Fill rate with data: 243.355 Mints/sec, 973.421Mbytes/sec
Time to second write access of data: 0.320262
Access rate of data: 312.244 Mints/sec, 1248.98Mbytes/sec
Time to third write access of data: 0.319223
Access rate of data: 313.261 Mints/sec, 1253.04Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2000Mbytes
Iteration 5
Time to malloc: 2.19345e-05
Time to fill with data: 0.418508
Fill rate with data: 238.944 Mints/sec, 955.777Mbytes/sec
Time to second write access of data: 0.320419
Access rate of data: 312.092 Mints/sec, 1248.37Mbytes/sec
Time to third write access of data: 0.319752
Access rate of data: 312.742 Mints/sec, 1250.97Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2400Mbytes
Iteration 6
Time to malloc: 3.19481e-05
Time to fill with data: 0.410054
Fill rate with data: 243.87 Mints/sec, 975.481Mbytes/sec
Time to second write access of data: 0.320244
Access rate of data: 312.262 Mints/sec, 1249.05Mbytes/sec
Time to third write access of data: 0.319546
Access rate of data: 312.944 Mints/sec, 1251.78Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 2800Mbytes
Iteration 7
Time to malloc: 3.19481e-05
Time to fill with data: 0.409491
Fill rate with data: 244.206 Mints/sec, 976.822Mbytes/sec
Time to second write access of data: 0.318501
Access rate of data: 313.971 Mints/sec, 1255.88Mbytes/sec
Time to third write access of data: 0.320052
Access rate of data: 312.449 Mints/sec, 1249.8Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3200Mbytes
Iteration 8
Time to malloc: 2.5034e-05
Time to fill with data: 0.409922
Fill rate with data: 243.949 Mints/sec, 975.795Mbytes/sec
Time to second write access of data: 0.320583
Access rate of data: 311.932 Mints/sec, 1247.73Mbytes/sec
Time to third write access of data: 0.319478
Access rate of data: 313.011 Mints/sec, 1252.04Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 3600Mbytes
Iteration 9
Time to malloc: 2.69413e-05
Time to fill with data: 0.41104
Fill rate with data: 243.285 Mints/sec, 973.141Mbytes/sec
Time to second write access of data: 0.320389
Access rate of data: 312.121 Mints/sec, 1248.48Mbytes/sec
Time to third write access of data: 0.319762
Access rate of data: 312.733 Mints/sec, 1250.93Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4000Mbytes
Iteration 10
Time to malloc: 2.59876e-05
Time to fill with data: 0.412612
Fill rate with data: 242.358 Mints/sec, 969.434Mbytes/sec
Time to second write access of data: 0.318304
Access rate of data: 314.165 Mints/sec, 1256.66Mbytes/sec
Time to third write access of data: 0.319453
Access rate of data: 313.035 Mints/sec, 1252.14Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4400Mbytes
Iteration 11
Time to malloc: 2.98023e-05
Time to fill with data: 0.412428
Fill rate with data: 242.467 Mints/sec, 969.866Mbytes/sec
Time to second write access of data: 0.318467
Access rate of data: 314.004 Mints/sec, 1256.02Mbytes/sec
Time to third write access of data: 0.319716
Access rate of data: 312.778 Mints/sec, 1251.11Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 4800Mbytes
Iteration 12
Time to malloc: 2.69413e-05
Time to fill with data: 0.410515
Fill rate with data: 243.597 Mints/sec, 974.386Mbytes/sec
Time to second write access of data: 0.31832
Access rate of data: 314.149 Mints/sec, 1256.6Mbytes/sec
Time to third write access of data: 0.319569
Access rate of data: 312.921 Mints/sec, 1251.69Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5200Mbytes
Iteration 13
Time to malloc: 2.28882e-05
Time to fill with data: 0.412385
Fill rate with data: 242.492 Mints/sec, 969.967Mbytes/sec
Time to second write access of data: 0.318929
Access rate of data: 313.549 Mints/sec, 1254.2Mbytes/sec
Time to third write access of data: 0.31949
Access rate of data: 312.999 Mints/sec, 1252Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 5600Mbytes
Iteration 14
Time to malloc: 2.90871e-05
Time to fill with data: 0.41235
Fill rate with data: 242.512 Mints/sec, 970.05Mbytes/sec
Time to second write access of data: 0.340456
Access rate of data: 293.724 Mints/sec, 1174.89Mbytes/sec
Time to third write access of data: 0.319716
Access rate of data: 312.778 Mints/sec, 1251.11Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes
使用THP - 分配速度更快但第二次和第三次访问的速度相同:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$ ./second
Iteration 0
Time to malloc: 1.50204e-05
Time to fill with data: 0.365043
Fill rate with data: 273.94 Mints/sec, 1095.76Mbytes/sec
Time to second write access of data: 0.320503
Access rate of data: 312.01 Mints/sec, 1248.04Mbytes/sec
Time to third write access of data: 0.319442
Access rate of data: 313.046 Mints/sec, 1252.18Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 400Mbytes
...
Iteration 14
Time to malloc: 2.7895e-05
Time to fill with data: 0.409294
Fill rate with data: 244.323 Mints/sec, 977.293Mbytes/sec
Time to second write access of data: 0.318422
Access rate of data: 314.049 Mints/sec, 1256.19Mbytes/sec
Time to third write access of data: 0.322098
Access rate of data: 310.465 Mints/sec, 1241.86Mbytes/sec
Allocated 400 Mbytes, with total memory allocated 6000Mbytes