EC2实例上的CPU间歇性峰值

时间:2019-01-22 09:16:30

标签: linux amazon-web-services

我们有一个Web应用程序,该应用程序在AWS Elastic Beanstalk中运行,并使用Java作为后端。间歇地,由于100%使用了CPU,负载平衡器后面的一台应用服务器崩溃。经过分析,我有理由相信底层操作系统可能在此问题中起作用。

这是/ var / log / messages文件中的代码片段,其中包含内核捕获错误的位置:

Jan  3 22:50:11 ip-10-220-46-44 kernel: [22649903.665305] ------------[ cut here ]------------
Jan  3 22:50:11 ip-10-220-46-44 kernel: [22649903.676120] WARNING: CPU: 0 PID: 12578 at fs/dcache.c:361 d_shrink_del+0x71/0x80()
Jan  3 22:50:11 ip-10-220-46-44 kernel: [22649903.680845] Modules linked in: ipv6 binfmt_misc evbug evdev psmouse i2c_piix4 ixgbevf i2c_core button ext4 crc16 jbd2 mbcache dm_mirror dm_region_hash dm_log dm_mod
Jan  3 22:50:11 ip-10-220-46-44 kernel: [22649903.691804] CPU: 0 PID: 12578 Comm: lsof Not tainted 3.14.48-33.39.amzn1.x86_64 #1
Jan  3 22:50:11 ip-10-220-46-44 kernel: [22649903.696742] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.700879]  0000000000000009 ffff880037dc1c90 ffffffff81487535 0000000000000000
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.706058]  ffff880037dc1cc8 ffffffff8105d39d ffff8800147b33c0 ffff880037dc1d48
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.711346]  ffff8800147b3418 ffff8800147b33c0 ffff8800147b33c0 ffff880037dc1cd8
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.716450] Call Trace:
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.718151]  [<ffffffff81487535>] dump_stack+0x45/0x56
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.721634]  [<ffffffff8105d39d>] warn_slowpath_common+0x7d/0xa0
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.725658]  [<ffffffff8105d47a>] warn_slowpath_null+0x1a/0x20
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.729562]  [<ffffffff811bc601>] d_shrink_del+0x71/0x80
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.732957]  [<ffffffff811bda24>] shrink_dentry_list+0x64/0xe0
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.736748]  [<ffffffff811be868>] shrink_dcache_parent+0x28/0x70
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.740689]  [<ffffffff8120fdf6>] proc_flush_task+0xa6/0x190
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.744318]  [<ffffffff8105e190>] release_task+0x30/0x450
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.748305]  [<ffffffff810946e1>] ? thread_group_cputime_adjusted+0x41/0x50
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.752971]  [<ffffffff8105ee7b>] wait_consider_task+0x8cb/0xb00
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.757236]  [<ffffffff8105f1b0>] do_wait+0x100/0x240
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.760681]  [<ffffffff810602e4>] SyS_wait4+0x64/0xe0
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.764026]  [<ffffffff8105def0>] ? task_stopped_code+0x60/0x60
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.768138]  [<ffffffff81496fc9>] system_call_fastpath+0x16/0x1b
Jan  3 22:50:12 ip-10-220-46-44 kernel: [22649903.772147] ---[ end trace e3b4a7d896ae3cba ]---
Jan  3 22:51:11 ip-10-220-46-44 kernel: [22649962.676054] INFO: rcu_sched self-detected stall on CPU
Jan  3 22:51:11 ip-10-220-46-44 kernel: [22649962.676054]       0: (14747 ticks this GP) idle=f7d/140000000000001/0 softirq=400397332/400397344
Jan  3 22:51:11 ip-10-220-46-44 kernel: [22649962.676054]        (t=14750 jiffies g=44792585 c=44792584 q=188767)

几个小时后,OOM杀手进入:

Jan  4 00:30:55 ip-10-220-46-44 kernel: [22655946.486183] Out of memory: Kill process 2776 (java) score 654 or sacrifice child

如您所见,OOM杀手选择Java进程并将其杀死。在那一刻,它消耗了aprox。 1.5 Gb的RAM。这低于通过-Xmx标志分配的2Gb堆。此后不久,我们的Web应用程序将重新启动几次,直到CPU达到100%。操作系统的内核版本为3.14.48-33.39.amzn1.x86_64

我的问题是: 根据以上信息,您认为此问题是由操作系统引起的吗?我看到的是日志行CPU: 0 PID: 12578 Comm: lsof Not tainted 3.14.48-33.39.amzn1.x86_64 #1,我认为从我们当前的发行版来看,lsof实用程序可能有问题。

我们将不胜感激任何帮助。

0 个答案:

没有答案