用户进程中的软锁定

时间:2016-10-14 06:44:00

标签: linux-kernel linux-device-driver watchdog softlock

我有一个问题,在客户机器上,用户空间进程正在占用处理器(软锁定)以及2内核进程和转储堆栈跟踪,显示所有3个进程中_ticket_spin_lock的RIP。

据我所知"如果用户空间进程导致软锁定,则会记录通过其pid标识进程的行,然后是各种CPU寄存器的内容,而不会显示任何调用跟踪各种"但在我的情况下,我也得到了用户进程的转储堆栈跟踪。

是否来自行为不当的用户空间应用? 这是软锁定的正常功能吗?如果是软锁定的功能那么如何解决问题?

任何帮助都将受到高度赞赏。

是x86_64机器,内核是3.1.10。我知道所有3个进程都在等待_ticket_spin_lock。 见: -

8月26日09:31:58 at-vie01a-cq21b内核:[115452.492033] BUG:软锁定 - CPU#3卡住22秒! [virtio_shm / 5/3:7874] 8月26日09:32:00 at-vie01a-cq21b内核:[115455.404215] BUG:软锁定 - CPU#31卡住23秒! [kni_thread:6605] 8月26日09:32:01 at-vie01a-cq21b内核:[115456.172014] BUG:软锁定 - CPU#0卡住22秒! [GIS:14145]

这里gis是我的用户空间进程,但有跟踪调用。

Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172014] BUG: soft lockup - CPU#0 stuck for 22s! [gis:14145]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172017] Modules linked in: xt_sharedlimit xt_hashlimit ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate ctr twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic cbc xcbc rmd160 crypto_null af_key iptable_mangle ip_set arptable_filter arp_tables iptable_raw iptable_nat nfnetlink_log nfnetlink ipt_ULOG ipt_PORTMAP af_packet zlib zlib_deflate sha512_generic sha256_generic sha1_generic md5 icp_qa_al pcie8120 rte_kni pfe_pep virtio_rte virtio_shm virtio_vtnet virtio_uio igb_uio virtio_ring virtio uio xt_tcpudp xt_state xt_pkttype nf_conntrack_control bonding binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf ipmi_devintf ipmi_si ipmi_msghandler edd nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage uas iTCO_wdt ioatdma pcspkr iTCO_vendor_support ixgbe igb wmi i2c_i801 mdio dca sg button container ipv6 autofs4 usbhid ehci_hcd megasr(P) usbcore processor thermal_sys
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172098] CPU 0
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172099] Modules linked in: xt_sharedlimit xt_hashlimit ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate ctr twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic cbc xcbc rmd160 crypto_null af_key iptable_mangle ip_set arptable_filter arp_tables iptable_raw iptable_nat nfnetlink_log nfnetlink ipt_ULOG ipt_PORTMAP af_packet zlib zlib_deflate sha512_generic sha256_generic sha1_generic md5 icp_qa_al pcie8120 rte_kni pfe_pep virtio_rte virtio_shm virtio_vtnet virtio_uio igb_uio virtio_ring virtio uio xt_tcpudp xt_state xt_pkttype nf_conntrack_control bonding binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf ipmi_devintf ipmi_si ipmi_msghandler edd nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage uas iTCO_wdt ioatdma pcspkr iTCO_vendor_support ixgbe igb wmi i2c_i801 mdio dca sg button container ipv6 autofs4 usbhid ehci_hcd megasr(P) usbcore processor thermal_sys
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172163]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172166] Pid: 14145, comm: gis Tainted: P 3.1.10-gb20-default #1 Intel Corporation S2600CO/S2600CO
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172170] RIP: 0010:[<ffffffff8102064d>] [<ffffffff8102064d>] __ticket_spin_lock+0x15/0x1b
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172178] RSP: 0000:ffff88043ee03cf0 EFLAGS: 00000293
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172180] RAX: 00000000000069bf RBX: 00000000020110ac RCX: 000000000000000e
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172182] RDX: 00000000000069bc RSI: 000000000000000e RDI: ffff88041e56a484
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172184] RBP: ffff88041e56a484 R08: ffff88041e56a740 R09: ffff8804154a5840
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172187] R10: 00007f0afce77000 R11: 0000000000000000 R12: ffff88043ee03c68
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172189] R13: ffffffff813f831e R14: ffff88041e56a484 R15: ffff88041e568280
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172192] FS: 00007f0afd70b700(0000) GS:ffff88043ee00000(0000) knlGS:0000000000000000
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172194] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172196] CR2: 00007f54f6b88098 CR3: 000000042427e000 CR4: 00000000000406f0
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172199] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172201] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172204] Process gis (pid: 14145, threadinfo ffff88037537e000, task ffff88036a8fe180)
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172205] Stack:
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172207] ffffffff8106b766 ffffffffa05e3a1e 0000000101b72e68 ffff8808260ae680
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172213] 0000002e1e568280 ffff880420450000 ffff88041f887a00 ffff880420450000
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172218] ffffffff8192a870 0000000000000608 0000000000000000 ffffffff81928b00
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172224] Call Trace:
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172233] [<ffffffff8106b766>] do_raw_spin_lock+0x5/0x8
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172240] [<ffffffffa05e3a1e>] packet_rcv+0x254/0x2ab [af_packet]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172257] [<ffffffff81337bbf>] __netif_receive_skb+0x2e1/0x36b
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172262] [<ffffffff81339722>] netif_receive_skb+0x7e/0x84
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172266] [<ffffffff8133979e>] napi_skb_finish+0x1c/0x31
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172277] [<ffffffffa031adee>] igb_clean_rx_irq+0x30d/0x39e [igb]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172298] [<ffffffffa031aecd>] igb_poll+0x4e/0x74 [igb]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172313] [<ffffffff81339c88>] net_rx_action+0x65/0x178
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172319] [<ffffffff81045c73>] __do_softirq+0xb2/0x19d
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172324] [<ffffffff813f9aac>] call_softirq+0x1c/0x30
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172329] [<ffffffff81003931>] do_softirq+0x3c/0x7b
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172333] [<ffffffff81045f98>] irq_exit+0x3c/0xac
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172337] [<ffffffff81003655>] do_IRQ+0x82/0x98
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172342] [<ffffffff813f24ee>] common_interrupt+0x6e/0x6e

2 个答案:

答案 0 :(得分:0)

从您的描述看起来问题是在内核中的某个地方而不是在用户进程中。您正在获得指向该方向的内核转储堆栈跟踪。它恰好发生,以便用户进程在该特定时间处于活动状态。

当内核执行线程长时间占用处理器时,会报告内核中的软锁定。大多数情况下,这是内核代码中出现问题的迹象,例如,在您的安装中使用的特定设备驱动程序中。看来您可能因各种原因而陷入僵局。如果没有看到代码和锁定的堆栈跟踪,就无法查明问题。

答案 1 :(得分:0)

我有另一个转储,我从转储跟踪中观察到的内容

kernel: [115455.404446]  [<ffffffff8106b766>] do_raw_spin_lock+0x5/0x8
kernel: [115455.404454]  [<ffffffffa05e3a1e>] packet_rcv+0x254/0x2ab [af_packet]
kernel: [115455.404477]  [<ffffffff81337bbf>] __netif_receive_skb+0x2e1/0x36b
kernel: [115455.404482]  [<ffffffff81339722>] netif_receive_skb+0x7e/0x84
kernel: [115455.404487]  [<ffffffff8133979e>] napi_skb_finish+0x1c/0x31
kernel: [115455.404497]  [<ffffffffa031adee>] igb_clean_rx_irq+0x30d/0x39e [igb]
kernel: [115455.404517]  [<ffffffffa031aecd>] igb_poll+0x4e/0x74 [igb]
kernel: [115455.404532]  [<ffffffff81339c88>] net_rx_action+0x65/0x178
kernel: [115455.404538]  [<ffffffff81045c73>] __do_softirq+0xb2/0x19d
kernel: [115455.404544]  [<ffffffff813f9aac>] call_softirq+0x1c/0x30
kernel: [115455.404550]  [<ffffffff81003931>] do_softirq+0x3c/0x7b
kernel: [115455.404555]  [<ffffffff81045f98>] irq_exit+0x3c/0xac
kernel: [115455.404558]  [<ffffffff81003655>] do_IRQ+0x82/0x98
kernel: [115455.404565]  [<ffffffff813f24ee>] common_interrupt+0x6e/0x6e
kernel: [115455.404573]  [<ffffffffa05e0003>] atomic_inc+0x3/0x4 [af_packet]
kernel: [115455.404579]  [<ffffffffa05e3a33>] packet_rcv+0x269/0x2ab [af_packet]
kernel: [115455.404589]  [<ffffffff81337bbf>] __netif_receive_skb+0x2e1/0x36b
kernel: [115455.404593]  [<ffffffff81339722>] netif_receive_skb+0x7e/0x84
kernel: [115455.404610]  [<ffffffffa041bd4b>] kni_net_rx_normal+0x12d/0x178 [rte_kni]
kernel: [115455.404690]  [<ffffffffa041ae58>] kni_thread+0x39/0x91 [rte_kni]
kernel: [115455.404758]  [<ffffffff8105975a>] kthread+0x76/0x7e
kernel: [115455.404763]  [<ffffffff813f99b4>] kernel_thread_helper+0x4/0x10

rte_kni在kthread上运行,就像用户空间上下文一样。 netif_receive_skb()被kni_net_rx_normal()称为普通函数,通常称为soft-irq context。现在在同一个内核上,我们收到了同一个套接字的softirq,然后我们就陷入死锁,因为当rte_kni调用内核函数时,我们还没有禁用内核上的softirq。

因此,要在此处禁用softirq以避免计时器与netif_receive_skb之间的竞争,应将其替换为netif_rx或在netif_receive_skb周围添加local_bh_disable / enable。