我在redhat6.3-6.5 上运行 hadoop2.2,并且我的所有计算机都在一段时间后崩溃了。 /var/log/messages
反复显示:
Aug 11 06:30:42 jn4_73_128 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [jsvc:11508]
Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w
dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m
od [last unloaded: scsi_wait_scan]
Aug 11 06:30:42 jn4_73_128 kernel: CPU 1
Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w
dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m
od [last unloaded: scsi_wait_scan]
Aug 11 06:30:42 jn4_73_128 kernel:
Aug 11 06:30:42 jn4_73_128 kernel: Pid: 11508, comm: jsvc Tainted: G W --------------- 2.6.32-279.el6.x86_64 #1 Dell Inc. PowerEdge R510/084YMW
Aug 11 06:30:42 jn4_73_128 kernel: RIP: 0010:[<ffffffff8104d088>] [<ffffffff8104d088>] wait_for_rqlock+0x28/0x40
Aug 11 06:30:42 jn4_73_128 kernel: RSP: 0018:ffff8807786c3ee8 EFLAGS: 00000202
Aug 11 06:30:42 jn4_73_128 kernel: RAX: 00000000f6e9f6e1 RBX: ffff8807786c3ee8 RCX: ffff880028216680
Aug 11 06:30:42 jn4_73_128 kernel: RDX: 00000000fffff6e9 RSI: ffff88061cd29370 RDI: 0000000000000286
Aug 11 06:30:42 jn4_73_128 kernel: RBP: ffffffff8100bc0e R08: 0000000000000001 R09: 0000000000000001
Aug 11 06:30:42 jn4_73_128 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000286
Aug 11 06:30:42 jn4_73_128 kernel: R13: ffff8807786c3eb8 R14: ffffffff810e0f6e R15: ffff8807786c3e48
Aug 11 06:30:42 jn4_73_128 kernel: FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
Aug 11 06:30:42 jn4_73_128 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 11 06:30:42 jn4_73_128 kernel: CR2: 0000000000e5bd70 CR3: 0000000001a85000 CR4: 00000000000006e0
Aug 11 06:30:42 jn4_73_128 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 11 06:30:42 jn4_73_128 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug 11 06:30:42 jn4_73_128 kernel: Process jsvc (pid: 11508, threadinfo ffff8807786c2000, task ffff880c1def3500)
Aug 11 06:30:42 jn4_73_128 kernel: Stack:
Aug 11 06:30:42 jn4_73_128 kernel: ffff8807786c3f68 ffffffff8107091b 0000000000000000 ffff8807786c3f28
Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff880701735260 ffff880c1def39c8 ffff880c1def39c8 0000000000000000
Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff8807786c3f28 ffff8807786c3f28 ffff8807786c3f78 00007f092d0ad700
Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
Aug 11 06:30:42 jn4_73_128 kernel: Code: ff ff 90 55 48 89 e5 0f 1f 44 00 00 48 c7 c0 80 66 01 00 65 48 8b 0c 25 b0 e0 00 00 0f ae f0 48 01 c1 eb 09 0f 1f 80 00 00 00 00 <f3> 90 8b 01 89 c2 c1 fa 10 66 39 c2 75 f2 c9 c3 0f 1f 84 00 00
Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
</em>
最后坠毁
crash /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux /opt/crash/127.0.0.1-2014-08-10-09\:47\:38/vmcore
crash 6.1.0-5.el6
Copyright (C) 2002-2012 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
please wait... (determining panic task)
WARNING: active task ffff881071850040 on cpu 12 not found in PID hash
KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux
DUMPFILE: /opt/crash/127.0.0.1-2014-08-10-09:47:38/vmcore [PARTIAL DUMP]
CPUS: 24
DATE: Sun Aug 10 09:47:32 2014
UPTIME: 7 days, 16:00:19
LOAD AVERAGE: 11.01, 3.11, 1.08
TASKS: 724
NODENAME: master1.otocyon.com
RELEASE: 2.6.32-431.5.1.el6.x86_64
VERSION: #1 SMP Fri Jan 10 14:46:43 EST 2014
MACHINE: x86_64 (1895 Mhz)
MEMORY: 64 GB
PANIC: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0"
PID: 23976
COMMAND: "sh"
TASK: ffff881071850aa0 [THREAD_INFO: ffff880a05c80000]
CPU: 0
STATE: TASK_INTERRUPTIBLE (PANIC)
crash> bt
PID: 23976 TASK: ffff881071850aa0 CPU: 0 COMMAND: "sh"
#0 [ffff880028207b50] machine_kexec at ffffffff81038f3b
#1 [ffff880028207bb0] crash_kexec at ffffffff810c5d82
#2 [ffff880028207c80] panic at ffffffff8152751a
#3 [ffff880028207d00] watchdog_overflow_callback at ffffffff810e696d
#4 [ffff880028207d20] __perf_event_overflow at ffffffff8111c847
#5 [ffff880028207da0] perf_event_overflow at ffffffff8111ce14
#6 [ffff880028207db0] intel_pmu_handle_irq at ffffffff81022d87
#7 [ffff880028207e90] perf_event_nmi_handler at ffffffff8152bd69
#8 [ffff880028207ea0] notifier_call_chain at ffffffff8152d825
#9 [ffff880028207ee0] atomic_notifier_call_chain at ffffffff8152d88a
#10 [ffff880028207ef0] notify_die at ffffffff810a153e
#11 [ffff880028207f20] do_nmi at ffffffff8152b4eb
#12 [ffff880028207f50] nmi at ffffffff8152adb0
[exception RIP: task_rq_unlock_wait+44]
RIP: ffffffff810534fc RSP: ffff880a05c81dc8 RFLAGS: 00000016
RAX: 000000000ec70ebe RBX: ffff881071850040 RCX: ffff8800282d6840
RDX: 0000000000000ec7 RSI: 0000000000000000 RDI: ffff881071850040
RBP: ffff880a05c81dc8 R8: dead000000200200 R9: dead000000200200
R10: ffff8810734a42d0 R11: 0000000000000246 R12: 00000000000114b8
R13: ffff8810734a4180 R14: ffff881071fd3440 R15: ffff881071fd3c48
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#13 [ffff880a05c81dc8] task_rq_unlock_wait at ffffffff810534fc
#14 [ffff880a05c81dd0] release_task at ffffffff81075454
#15 [ffff880a05c81e10] wait_consider_task at ffffffff81075fb6
#16 [ffff880a05c81e80] do_wait at ffffffff810763e6
#17 [ffff880a05c81ee0] sys_wait4 at ffffffff810765d3
#18 [ffff880a05c81f80] system_call_fastpath at ffffffff8100b072
RIP: 0000003e1a2ac8be RSP: 00007fffa58c6330 RFLAGS: 00010207
RAX: 000000000000003d RBX: ffffffff8100b072 RCX: 0000003e1a232be0
RDX: 0000000000000000 RSI: 00007fffa58c62ec RDI: ffffffffffffffff
RBP: 00000000ffffffff R8: 000000000203b8d0 R9: 000000000203d590
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000005d00
ORIG_RAX: 000000000000003d CS: 0033 SS: 002b
它发生在来自不同供应商的机器上,我试图从redhat更新到最新的内核。 任何有相同经历的人都能帮忙吗?