我们有一个Web应用程序,该应用程序在AWS Elastic Beanstalk中运行,并使用Java作为后端。间歇地,由于100%使用了CPU,负载平衡器后面的一台应用服务器崩溃。经过分析,我有理由相信底层操作系统可能在此问题中起作用。
这是/ var / log / messages文件中的代码片段,其中包含内核捕获错误的位置:
Jan 3 22:50:11 ip-10-220-46-44 kernel: [22649903.665305] ------------[ cut here ]------------
Jan 3 22:50:11 ip-10-220-46-44 kernel: [22649903.676120] WARNING: CPU: 0 PID: 12578 at fs/dcache.c:361 d_shrink_del+0x71/0x80()
Jan 3 22:50:11 ip-10-220-46-44 kernel: [22649903.680845] Modules linked in: ipv6 binfmt_misc evbug evdev psmouse i2c_piix4 ixgbevf i2c_core button ext4 crc16 jbd2 mbcache dm_mirror dm_region_hash dm_log dm_mod
Jan 3 22:50:11 ip-10-220-46-44 kernel: [22649903.691804] CPU: 0 PID: 12578 Comm: lsof Not tainted 3.14.48-33.39.amzn1.x86_64 #1
Jan 3 22:50:11 ip-10-220-46-44 kernel: [22649903.696742] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.700879] 0000000000000009 ffff880037dc1c90 ffffffff81487535 0000000000000000
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.706058] ffff880037dc1cc8 ffffffff8105d39d ffff8800147b33c0 ffff880037dc1d48
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.711346] ffff8800147b3418 ffff8800147b33c0 ffff8800147b33c0 ffff880037dc1cd8
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.716450] Call Trace:
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.718151] [<ffffffff81487535>] dump_stack+0x45/0x56
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.721634] [<ffffffff8105d39d>] warn_slowpath_common+0x7d/0xa0
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.725658] [<ffffffff8105d47a>] warn_slowpath_null+0x1a/0x20
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.729562] [<ffffffff811bc601>] d_shrink_del+0x71/0x80
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.732957] [<ffffffff811bda24>] shrink_dentry_list+0x64/0xe0
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.736748] [<ffffffff811be868>] shrink_dcache_parent+0x28/0x70
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.740689] [<ffffffff8120fdf6>] proc_flush_task+0xa6/0x190
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.744318] [<ffffffff8105e190>] release_task+0x30/0x450
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.748305] [<ffffffff810946e1>] ? thread_group_cputime_adjusted+0x41/0x50
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.752971] [<ffffffff8105ee7b>] wait_consider_task+0x8cb/0xb00
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.757236] [<ffffffff8105f1b0>] do_wait+0x100/0x240
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.760681] [<ffffffff810602e4>] SyS_wait4+0x64/0xe0
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.764026] [<ffffffff8105def0>] ? task_stopped_code+0x60/0x60
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.768138] [<ffffffff81496fc9>] system_call_fastpath+0x16/0x1b
Jan 3 22:50:12 ip-10-220-46-44 kernel: [22649903.772147] ---[ end trace e3b4a7d896ae3cba ]---
Jan 3 22:51:11 ip-10-220-46-44 kernel: [22649962.676054] INFO: rcu_sched self-detected stall on CPU
Jan 3 22:51:11 ip-10-220-46-44 kernel: [22649962.676054] 0: (14747 ticks this GP) idle=f7d/140000000000001/0 softirq=400397332/400397344
Jan 3 22:51:11 ip-10-220-46-44 kernel: [22649962.676054] (t=14750 jiffies g=44792585 c=44792584 q=188767)
几个小时后,OOM杀手进入:
Jan 4 00:30:55 ip-10-220-46-44 kernel: [22655946.486183] Out of memory: Kill process 2776 (java) score 654 or sacrifice child
如您所见,OOM杀手选择Java进程并将其杀死。在那一刻,它消耗了aprox。 1.5 Gb的RAM。这低于通过-Xmx标志分配的2Gb堆。此后不久,我们的Web应用程序将重新启动几次,直到CPU达到100%。操作系统的内核版本为3.14.48-33.39.amzn1.x86_64
。
我的问题是:
根据以上信息,您认为此问题是由操作系统引起的吗?我看到的是日志行CPU: 0 PID: 12578 Comm: lsof Not tainted 3.14.48-33.39.amzn1.x86_64 #1
,我认为从我们当前的发行版来看,lsof
实用程序可能有问题。
我们将不胜感激任何帮助。