hadoop导致系统崩溃"软锁"和"硬锁"

时间:2014-08-11 12:02:33

标签: hadoop crash centos rhel

我在redhat6.3-6.5 上运行 hadoop2.2,并且我的所有计算机都在一段时间后崩溃了。 /var/log/messages反复显示:

Aug 11 06:30:42 jn4_73_128 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [jsvc:11508]
Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w
dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m
od [last unloaded: scsi_wait_scan]
Aug 11 06:30:42 jn4_73_128 kernel: CPU 1 
Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w
dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m
od [last unloaded: scsi_wait_scan]
Aug 11 06:30:42 jn4_73_128 kernel: 
Aug 11 06:30:42 jn4_73_128 kernel: Pid: 11508, comm: jsvc Tainted: G        W  ---------------    2.6.32-279.el6.x86_64 #1 Dell Inc. PowerEdge R510/084YMW
Aug 11 06:30:42 jn4_73_128 kernel: RIP: 0010:[<ffffffff8104d088>]  [<ffffffff8104d088>] wait_for_rqlock+0x28/0x40
Aug 11 06:30:42 jn4_73_128 kernel: RSP: 0018:ffff8807786c3ee8  EFLAGS: 00000202
Aug 11 06:30:42 jn4_73_128 kernel: RAX: 00000000f6e9f6e1 RBX: ffff8807786c3ee8 RCX: ffff880028216680
Aug 11 06:30:42 jn4_73_128 kernel: RDX: 00000000fffff6e9 RSI: ffff88061cd29370 RDI: 0000000000000286
Aug 11 06:30:42 jn4_73_128 kernel: RBP: ffffffff8100bc0e R08: 0000000000000001 R09: 0000000000000001
Aug 11 06:30:42 jn4_73_128 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000286
Aug 11 06:30:42 jn4_73_128 kernel: R13: ffff8807786c3eb8 R14: ffffffff810e0f6e R15: ffff8807786c3e48
Aug 11 06:30:42 jn4_73_128 kernel: FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
Aug 11 06:30:42 jn4_73_128 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 11 06:30:42 jn4_73_128 kernel: CR2: 0000000000e5bd70 CR3: 0000000001a85000 CR4: 00000000000006e0
Aug 11 06:30:42 jn4_73_128 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 11 06:30:42 jn4_73_128 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug 11 06:30:42 jn4_73_128 kernel: Process jsvc (pid: 11508, threadinfo ffff8807786c2000, task ffff880c1def3500)
Aug 11 06:30:42 jn4_73_128 kernel: Stack:
Aug 11 06:30:42 jn4_73_128 kernel: ffff8807786c3f68 ffffffff8107091b 0000000000000000 ffff8807786c3f28
Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff880701735260 ffff880c1def39c8 ffff880c1def39c8 0000000000000000
Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff8807786c3f28 ffff8807786c3f28 ffff8807786c3f78 00007f092d0ad700
Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
Aug 11 06:30:42 jn4_73_128 kernel: Code: ff ff 90 55 48 89 e5 0f 1f 44 00 00 48 c7 c0 80 66 01 00 65 48 8b 0c 25 b0 e0 00 00 0f ae f0 48 01 c1 eb 09 0f 1f 80 00 00 00 00 <f3> 90 8b 01 89 c2 c1 fa 10 66 39 c2 75 f2 c9 c3 0f 1f 84 00 00 
Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
</em>

最后坠毁

crash /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux  /opt/crash/127.0.0.1-2014-08-10-09\:47\:38/vmcore

crash 6.1.0-5.el6
Copyright (C) 2002-2012  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

please wait... (determining panic task)         
WARNING: active task ffff881071850040 on cpu 12 not found in PID hash

      KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux
    DUMPFILE: /opt/crash/127.0.0.1-2014-08-10-09:47:38/vmcore  [PARTIAL DUMP]
        CPUS: 24
        DATE: Sun Aug 10 09:47:32 2014
      UPTIME: 7 days, 16:00:19
LOAD AVERAGE: 11.01, 3.11, 1.08
       TASKS: 724
    NODENAME: master1.otocyon.com
     RELEASE: 2.6.32-431.5.1.el6.x86_64
     VERSION: #1 SMP Fri Jan 10 14:46:43 EST 2014
     MACHINE: x86_64  (1895 Mhz)
      MEMORY: 64 GB
       PANIC: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0"
         PID: 23976
     COMMAND: "sh"
        TASK: ffff881071850aa0  [THREAD_INFO: ffff880a05c80000]
         CPU: 0
       STATE: TASK_INTERRUPTIBLE (PANIC)

crash> bt
PID: 23976  TASK: ffff881071850aa0  CPU: 0   COMMAND: "sh"
 #0 [ffff880028207b50] machine_kexec at ffffffff81038f3b
 #1 [ffff880028207bb0] crash_kexec at ffffffff810c5d82
 #2 [ffff880028207c80] panic at ffffffff8152751a
 #3 [ffff880028207d00] watchdog_overflow_callback at ffffffff810e696d
 #4 [ffff880028207d20] __perf_event_overflow at ffffffff8111c847
 #5 [ffff880028207da0] perf_event_overflow at ffffffff8111ce14
 #6 [ffff880028207db0] intel_pmu_handle_irq at ffffffff81022d87
 #7 [ffff880028207e90] perf_event_nmi_handler at ffffffff8152bd69
 #8 [ffff880028207ea0] notifier_call_chain at ffffffff8152d825
 #9 [ffff880028207ee0] atomic_notifier_call_chain at ffffffff8152d88a
#10 [ffff880028207ef0] notify_die at ffffffff810a153e
#11 [ffff880028207f20] do_nmi at ffffffff8152b4eb
#12 [ffff880028207f50] nmi at ffffffff8152adb0
    [exception RIP: task_rq_unlock_wait+44]
    RIP: ffffffff810534fc  RSP: ffff880a05c81dc8  RFLAGS: 00000016
    RAX: 000000000ec70ebe  RBX: ffff881071850040  RCX: ffff8800282d6840
    RDX: 0000000000000ec7  RSI: 0000000000000000  RDI: ffff881071850040
    RBP: ffff880a05c81dc8   R8: dead000000200200   R9: dead000000200200
    R10: ffff8810734a42d0  R11: 0000000000000246  R12: 00000000000114b8
    R13: ffff8810734a4180  R14: ffff881071fd3440  R15: ffff881071fd3c48
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#13 [ffff880a05c81dc8] task_rq_unlock_wait at ffffffff810534fc
#14 [ffff880a05c81dd0] release_task at ffffffff81075454
#15 [ffff880a05c81e10] wait_consider_task at ffffffff81075fb6
#16 [ffff880a05c81e80] do_wait at ffffffff810763e6
#17 [ffff880a05c81ee0] sys_wait4 at ffffffff810765d3
#18 [ffff880a05c81f80] system_call_fastpath at ffffffff8100b072
    RIP: 0000003e1a2ac8be  RSP: 00007fffa58c6330  RFLAGS: 00010207
    RAX: 000000000000003d  RBX: ffffffff8100b072  RCX: 0000003e1a232be0
    RDX: 0000000000000000  RSI: 00007fffa58c62ec  RDI: ffffffffffffffff
    RBP: 00000000ffffffff   R8: 000000000203b8d0   R9: 000000000203d590
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000001  R15: 0000000000005d00
    ORIG_RAX: 000000000000003d  CS: 0033  SS: 002b

它发生在来自不同供应商的机器上,我试图从redhat更新到最新的内核。 任何有相同经历的人都能帮忙吗?

0 个答案:

没有答案