理解内核恐慌/建议

时间:2017-01-21 09:52:43

标签: linux-kernel

我正在使用debian jessie和内核3.16.39-1:

# apt-cache policy linux-image-3.16.0-4-amd64
linux-image-3.16.0-4-amd64:
Installed: 3.16.39-1
Candidate: 3.16.39-1
Version table:
*** 3.16.39-1 0
    500 http://ftp.fr.debian.org/debian/ jessie/main amd64 Packages
    100 /var/lib/dpkg/status

本机正在使用2个绑定接口:

  • bond0:2 * 10Gb / s ixgbe X520
  • bond1:2 * 10Gb / s ixgbe X520

irqbalance正在这台机器上运行。

在网络负载下(bond1上12Gb / s),我得到了以下内核恐慌:

kernel: [26339.017497] Call Trace:
kernel: [26339.017499]  <IRQ>  [<ffffffff81514c11>] ?     dump_stack+0x5d/0x78
kernel: [26339.017509]  [<ffffffff81144a3f>] ? warn_alloc_failed+0xdf/0x130
kernel: [26339.017513]  [<ffffffff810a949d>] ? __wake_up_sync_key+0x3d/0x60
kernel: [26339.017515]  [<ffffffff81148daf>] ? __alloc_pages_nodemask+0x8ef/0xb50
kernel: [26339.017519]  [<ffffffff8147eaff>] ? tcp_v4_do_rcv+0x1af/0x4c0
kernel: [26339.017524]  [<ffffffff81455b66>] ? nf_hook_slow+0x76/0x130
kernel: [26339.017528]  [<ffffffff811883ad>] ? alloc_pages_current+0x9d/0x150
kernel: [26339.017531]  [<ffffffff81412d7b>] ? __netdev_alloc_frag+0x8b/0x140
kernel: [26339.017534]  [<ffffffff8141913f>] ? __netdev_alloc_skb+0x6f/0xf0
kernel: [26339.017558]  [<ffffffffa0146a0d>] ? ixgbe_clean_rx_irq+0x10d/0xb70 [ixgbe]
kernel: [26339.017564]  [<ffffffffa0148198>] ? ixgbe_poll+0x488/0x860 [ixgbe]
kernel: [26339.017567]  [<ffffffff8108c9ad>] ? hrtimer_get_next_event+0xad/0xc0
kernel: [26339.017570]  [<ffffffff81425509>] ? net_rx_action+0x129/0x250
kernel: [26339.017573]  [<ffffffff8106d911>] ? __do_softirq+0xf1/0x2d0
kernel: [26339.017575]  [<ffffffff8106dd25>] ? irq_exit+0x95/0xa0
kernel: [26339.017578]  [<ffffffff8151dbe2>] ? do_IRQ+0x52/0xe0
kernel: [26339.017582]  [<ffffffff8151ba2d>] ? common_interrupt+0x6d/0x6d
kernel: [26339.017583]  <EOI>  [<ffffffff8108c31d>] ? __hrtimer_start_range_ns+0x1cd/0x3a0
kernel: [26339.017588]  [<ffffffff813e32a2>] ? cpuidle_enter_state+0x52/0xc0
kernel: [26339.017590]  [<ffffffff813e3298>] ? cpuidle_enter_state+0x48/0xc0
kernel: [26339.017592]  [<ffffffff810a9b28>] ? cpu_startup_entry+0x328/0x470
kernel: [26339.017595]  [<ffffffff81043fdf>] ? start_secondary+0x20f/0x2d0
[....]
kernel: [26339.017647] swapper/13: page allocation failure: order:0, mode:0x20
kernel: [26339.017667] active_anon:2860787 inactive_anon:290478 isolated_anon:15723
kernel: [26339.017667]  active_file:284318 inactive_file:151176 isolated_file:0
kernel: [26339.017667]  unevictable:20736 dirty:24804 writeback:4297 unstable:0
kernel: [26339.017667]  free:23079 slab_reclaimable:27293 slab_unreclaimable:86672
kernel: [26339.017667]  mapped:22343 shmem:413 pagetables:10111 bounce:0
kernel: [26339.017667]  free_cma:0
kernel: [26339.017670] Node 0 DMA free:15896kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15980kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
kernel: [26339.017675] lowmem_reserve[]: 0 3191 16016 16016
kernel: [26339.017680] Node 0 DMA32 free:56312kB min:13456kB low:16820kB high:20184kB active_anon:589468kB inactive_anon:141384kB active_file:1132312kB inactive_file:597576kB unevictable:16616kB isolated(anon):0kB isolated(file):0kB present:3345344kB managed:3270860kB mlocked:16616kB dirty:33860kB writeback:4288kB mapped:18616kB shmem:180kB slab_reclaimable:17036kB slab_unreclaimable:83696kB kernel_stack:34016kB pagetables:8384kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
kernel: [26339.017686] lowmem_reserve[]: 0 0 12824 12824
kernel: [26339.017691] Node 0 Normal free:20108kB min:54060kB low:67572kB high:81088kB active_anon:10853680kB inactive_anon:1020528kB active_file:4960kB inactive_file:7128kB unevictable:66328kB isolated(anon):62892kB isolated(file):0kB present:13369344kB managed:13131968kB mlocked:66328kB dirty:65356kB writeback:12900kB mapped:70756kB shmem:1472kB slab_reclaimable:92136kB slab_unreclaimable:262992kB kernel_stack:10880kB pagetables:32060kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4275 all_unreclaimable? no
kernel: [26339.017696] lowmem_reserve[]: 0 0 0 0
kernel: [26339.017701] Node 0 DMA: 0*4kB  0000000000000020 ffff88042f1a3bf0
kernel: [26339.017706] 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15896kB
kernel: [26339.017723] Node 0 DMA32: 250*4kB 
kernel: [26339.017726]  ffffffff81144a3f 0000000000000000 0000000000000000 ffffffff00000002
kernel: [26339.017730] (EM) 967*8kB (UEM) 2628*16kB (UM) 83*32kB (UMR) 15*64kB (R) 8*128kB (R) 4*256kB (R) 0*512kB 0*1024kB 0*2048kB <4>[26339.017747] swapper/0: page allocation failure: order:0, mode:0x20
kernel: [26339.017748] 0*4096kB = 56448kB
kernel: [26339.017751] Node 0 Normal: 3653*4kB (M) 0*8kB 0*16kB 1*32kB (R) 0*64kB 1*128kB (R) 0*256kB 1*512kB (R) 0*1024kB 1*2048kB (R) 0*4096kB = 17332kB
kernel: [26339.017767] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
kernel: [26339.017768] 466495 total pagecache pages
kernel: [26339.017769] 10046 pages in swap cache
kernel: [26339.017771] Swap cache stats: add 4415081, delete 4405035, find 1682225/2488531
kernel: [26339.017772] Free swap  = 19301256kB
kernel: [26339.017773] Total swap = 19764220kB
kernel: [26339.017774] 4182667 pages RAM
kernel: [26339.017775] 0 pages HighMem/MovableOnly
kernel: [26339.017776] 59344 pages reserved
kernel: [26339.017777] 0 pages hwpoisoned

内核恐慌显示与irq和ixgbe相关的消息。

有人可以给我一些建议来解决这个问题吗?服务器在2小时内运行正常,网络负载相同,没有任何问题。

问候,

1 个答案:

答案 0 :(得分:0)

调用跟踪无法显示与内核崩溃相关的任何调试信息。

kernel: [26339.017509]  [<ffffffff81144a3f>] ? warn_alloc_failed+0xdf/0x130
kernel: [26339.017513]  [<ffffffff810a949d>] ? __wake_up_sync_key+0x3d/0x60
kernel: [26339.017515]  [<ffffffff81148daf>] ? __alloc_pages_nodemask+0x8ef/0xb50
kernel: [26339.017519]  [<ffffffff8147eaff>] ? tcp_v4_do_rcv+0x1af/0x4c0
kernel: [26339.017524]  [<ffffffff81455b66>] ? nf_hook_slow+0x76/0x130
kernel: [26339.017528]  [<ffffffff811883ad>] ? alloc_pages_current+0x9d/0x150
kernel: [26339.017531]  [<ffffffff81412d7b>] ? __netdev_alloc_frag+0x8b/0x140
kernel: [26339.017534]  [<ffffffff8141913f>] ? __netdev_alloc_skb+0x6f/0xf0
kernel: [26339.017558]  [<ffffffffa0146a0d>] ? ixgbe_clean_rx_irq+0x10d/0xb70 [ixgbe]
kernel: [26339.017564]  [<ffffffffa0148198>] ? ixgbe_poll+0x488/0x860 [ixgbe]
kernel: [26339.017567]  [<ffffffff8108c9ad>] ? hrtimer_get_next_event+0xad/0xc0

而不是上面的呼叫追踪,下面的签名表示页面饥饿的迹象。

kernel: [26339.017667] active_anon:2860787 inactive_anon:290478 isolated_anon:15723
kernel: [26339.017667]  active_file:284318 inactive_file:151176 isolated_file:0
kernel: [26339.017667]  unevictable:20736 dirty:24804 writeback:4297 unstable:0
kernel: [26339.017667]  free:23079 slab_reclaimable:27293 slab_unreclaimable:86672
kernel: [26339.017667]  mapped:22343 shmem:413 pagetables:10111 bounce:0
kernel: [26339.017667]  free_cma:0

作为&#34; inactive_anon:290478,inactive_file:151176&#34;签名表明,DMA区域页面饥饿的可能性很高。 如果您参考以下指令,您将了解我们的系统是否正在通过内核内存泄漏。

  1. 内核:添加与kmem leak相关的配置
  2. diff --git a / arch / arm / configs / pompeii_defconfig b / arch / arm / configs / pompeii_defconfig     index 2e97f97..aac678a 100644     --- a / arch / arm / configs / pompeii_defconfig     +++ b / arch / arm / configs / pompeii_defconfig     @@ -754,8 +754,8 @@      CONFIG_SLUB_DEBUG_PANIC_ON = Y      CONFIG_SLUB_DEBUG_ON = Y      CONFIG_DEBUG_KMEMLEAK = Y     -CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE = 4000     -CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF = Y     + CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE = 40000     +#CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF未设置      CONFIG_DEBUG_STACK_USAGE = Y      CONFIG_DEBUG_VM = Y      CONFIG_DEBUG_MEMORY_INIT = Y

    1. 确保将内核命令行添加到&#34;&#34; kmemleak =上&#34;&#34;

    2. 输入以下命令10分钟后, 回声扫描&gt; / SYS /内核/调试/ kmemleak

    3. 可以使用以下命令显示内核内存泄漏的输出。 猫&gt; / SYS /内核/调试/ kmemleak