我正在开发一个raid系统,当我发出大量并行IO时,我正在跟踪BUG。 我无法弄清楚如何调试这个转储。这里显而易见的是,RAID_LAZY_WRITE线程试图在锁定CPU时进行调度,但我无法准确追踪它如何以及为什么正在发生。因为我没有获得任何锁定来发布IO来阻止图层。
我启用了自旋锁调试和锁定依赖性检查,以便我可以获得详细的调试信息。 这是我收到的整个转储,请提供我的提示,想法 如何使用此转储进行分析和调试。
BUG: scheduling while atomic: RAID_LAZY_WRITE/102/0x00000205
no locks held by RAID_LAZY_WRITE/102.
Showing all locks held in the system:
=============================================
Pid: 102, comm: RAID_LAZY_WRITE
CPU: 0 Not tainted (2.6.24 #274)
PC is at add_preempt_count+0xa0/0xbc
LR is at raid_write_stripe+0x388/0x544
pc : [<c004b014>] lr : [<c015a83c>] psr: 80000013
sp : cfc63ec4 ip : cfc63ed4 fp : cfc63ed0
r10: cf5ce118 r9 : 00000001 r8 : 000000b4
r7 : 00000001 r6 : cf403cd8 r5 : cfc62000 r4 : 00000002
r3 : 00000001 r2 : 00000001 r1 : cfc62000 r0 : 00000001
Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
Control: 0000397f Table: 0f558000 DAC: 00000017
[<c0039fa0>] (show_regs+0x0/0x4c) from [<c004ae14>] (__schedule_bug+0x54/0x68)
r4:cfc9b0a0
[<c004adc0>] (__schedule_bug+0x0/0x68) from [<c029747c>] (schedule+0x74/0x2f4)
r5:cfc9b0a0 r4:c034d13c
[<c0297408>] (schedule+0x0/0x2f4) from [<c029772c>] (io_schedule+0x30/0x54)
[<c02976fc>] (io_schedule+0x0/0x54) from [<c012e214>]
(get_request_wait+0xac/0x140)
r4:00000000
[<c012e168>] (get_request_wait+0x0/0x140) from [<c012e864>]
(__make_request+0x478/0x544)
[<c012e3ec>] (__make_request+0x0/0x544) from [<c012c3d0>]
(generic_make_request+0x2a8/0x2f0)
[<c012c128>] (generic_make_request+0x0/0x2f0) from [<c0168858>]
(raid_submit_disk_io+0x348/0x37c)
[<c0168510>] (raid_submit_disk_io+0x0/0x37c) from [<c016be24>]
(perform_raid5_ops+0xdc8/0x1100)
[<c016b05c>] (perform_raid5_ops+0x0/0x1100) from [<c016c594>]
(raid5_RCW_postXOR+0x438/0x4c4)
[<c016c15c>] (raid5_RCW_postXOR+0x0/0x4c4) from [<c016cb8c>]
(raid5_xor_completed+0x2a4/0x3b0)
[<c016c8e8>] (raid5_xor_completed+0x0/0x3b0) from [<c01fb3a4>]
(iop_adma_run_tx_complete_actions+0x64/0xe8)
[<c01fb340>] (iop_adma_run_tx_complete_actions+0x0/0xe8) from
[<c01fb720>] (__iop_adma_slot_cleanup+0x2f8/0x3c0)
r8:c0352f24 r7:00000000 r6:00000000 r5:00000000 r4:cf4faa7c
[<c01fb428>] (__iop_adma_slot_cleanup+0x0/0x3c0) from [<c01fb7f8>]
(iop_adma_tasklet+0x10/0x14)
[<c01fb7e8>] (iop_adma_tasklet+0x0/0x14) from [<c0057204>]
(run_timer_softirq+0x17c/0x208)
[<c0057088>] (run_timer_softirq+0x0/0x208) from [<c0053268>]
(__do_softirq+0x64/0xd0)
[<c0053204>] (__do_softirq+0x0/0xd0) from [<c005331c>] (irq_exit+0x48/0x5c)
[<c00532d4>] (irq_exit+0x0/0x5c) from [<c0038048>]
(__exception_text_start+0x48/0x60)
[<c0038000>] (__exception_text_start+0x0/0x60) from [<c0038ab0>]
(__irq_svc+0x50/0x7c)
Exception stack(0xcfc63e7c to 0xcfc63ec4)
3e60: 00000001
3e80: cfc62000 00000001 00000001 00000002 cfc62000 cf403cd8 00000001 000000b4
3ea0: 00000001 cf5ce118 cfc63ed0 cfc63ed4 cfc63ec4 c015a83c c004b014 80000013
3ec0: ffffffff
r6:00000020 r5:cfc63eb0 r4:ffffffff
[<c004af74>] (add_preempt_count+0x0/0xbc) from [<c015a83c>]
(raid_write_stripe+0x388/0x544)
[<c015a4b4>] (raid_write_stripe+0x0/0x544) from [<c015fc44>]
(raid_lazy_writer+0x4c4/0x6b8)
[<c015f780>] (raid_lazy_writer+0x0/0x6b8) from [<c0159528>]
(raid_thread_daemon+0x140/0x15c)
[<c01593e8>] (raid_thread_daemon+0x0/0x15c) from [<c0062c78>]
(kthread+0x58/0x90)
[<c0062c20>] (kthread+0x0/0x90) from [<c0051284>] (do_exit+0x0/0x764)
r6:00000000 r5:00000000 r4:00000000
BUG: spinlock cpu recursion on CPU#0, RAID_DAEMON/101
lock: cf4faae0, .magic: dead4ead, .owner: RAID_LAZY_WRITE/102, .owner_cpu: 0
[<c032a9d0>] (dump_stack+0x0/0x14) from [<c0141c90>] (spin_bug+0x90/0xa4)
[<c0141c00>] (spin_bug+0x0/0xa4) from [<c0141e08>] (_raw_spin_lock+0x68/0x15c)
r5:c01fb378 r4:cf4faae0
[<c0141da0>] (_raw_spin_lock+0x0/0x15c) from [<c0298ec8>]
(_spin_lock_bh+0x4c/0x54)
[<c0298e7c>] (_spin_lock_bh+0x0/0x54) from [<c01fb378>]
(iop_adma_run_tx_complete_actions+0x38/0xe8)
r5:00000000 r4:cf4faa7c
[<c01fb340>] (iop_adma_run_tx_complete_actions+0x0/0xe8) from
[<c01fb720>] (__iop_adma_slot_cleanup+0x2f8/0x3c0)
r8:00000001 r7:00000000 r6:00000000 r5:00000000 r4:cf4faa7c
[<c01fb428>] (__iop_adma_slot_cleanup+0x0/0x3c0) from [<c01fb7f8>]
(iop_adma_tasklet+0x10/0x14)
[<c01fb7e8>] (iop_adma_tasklet+0x0/0x14) from [<c0052e30>]
(tasklet_action+0x8c/0xe4)
[<c0052da4>] (tasklet_action+0x0/0xe4) from [<c0053268>]
(__do_softirq+0x64/0xd0)
r6:0000000a r5:00000001 r4:c0351f44
[<c0053204>] (__do_softirq+0x0/0xd0) from [<c005331c>] (irq_exit+0x48/0x5c)
[<c00532d4>] (irq_exit+0x0/0x5c) from [<c0038048>]
(__exception_text_start+0x48/0x60)
[<c0038000>] (__exception_text_start+0x0/0x60) from [<c0038ab0>]
(__irq_svc+0x50/0x7c)
Exception stack(0xcfcb1e64 to 0xcfcb1eac)
1e60: cf652404 00000001 ffffffff 00000001 cf652404 40000013 c0417bec
1e80: 7fffffff cea81d88 00000000 cf6523e0 cfcb1ec0 cfcb1eac cfcb1eac c0298d00
1ea0: c0298d04 60000013 ffffffff
r6:00000064 r5:cfcb1e98 r4:ffffffff
[<c0298cd4>] (_spin_unlock_irqrestore+0x0/0x54) from [<c015c344>]
(raid_read_stripe+0x204/0x560)
r5:00000008 r4:cf652404
[<c015c140>] (raid_read_stripe+0x0/0x560) from [<c01602bc>]
(raid_daemon+0x1d4/0x1110)
[<c01600e8>] (raid_daemon+0x0/0x1110) from [<c0159528>]
(raid_thread_daemon+0x140/0x15c)
[<c01593e8>] (raid_thread_daemon+0x0/0x15c) from [<c0062c78>]
(kthread+0x58/0x90)
[<c0062c20>] (kthread+0x0/0x90) from [<c0051284>] (do_exit+0x0/0x764)
r6:00000000 r5:00000000 r4:00000000
谢谢。
此致 维卡什库马尔 http://VikashKumar.com/
答案 0 :(得分:3)
我通过调用“kmalloc(sizeof(struct mystruct),GFP_ATOMIC)”而不是“kmalloc(sizeof(struct mystruct),GFP_KERNEL)”解决了我的问题。这是一篇关于原子性的好文章:http://kerneltrap.org/node/435
我很抱歉,如果我的评论没有帮助你,Vikash,但是我试着提供帮助,因为“原子时的策划”这个错误给我带来了很大的打击。
问候,babanjida
答案 1 :(得分:2)
问题似乎是从softirq上下文调用块层API(generic_make_request),这实际上是另一个IO或XOR操作的回调。 显然,回调是在定时器上下文中调用的,就像从调用堆栈中看到的那样,这是一个中断,我们无法在中断上下文中睡眠,因此就是bug。
然而,这需要进行验证,我会尽快更新此线程。直到那时我欢迎您就此提出意见。 击>
我已经验证了这一点,并且从软irq线程(读回调)调用块IO api(generic_make_request)是一个坏主意。这导致调度而原子BUG。 我现在已将该任务卸载到另一个线程,并且完全正常。
谢谢。
此致 Vikash
答案 2 :(得分:1)
我的问题与你的问题不同,但我得到了“Bug:调度原子”。我花了两天时间调试我的代码,最后我发现(不确定100%,但几乎可以肯定)当我在一个schedeled函数中调用“kmalloc(sizeof(struct mystruct),GFP_KERNEL)”时,问题就出现了。我决定发表这些评论,因为考虑删除kmalloc以解决我的问题对我来说并不明显。
现在我应该考虑另一种在schedeled函数中分配内存而不再调用bug的方法!
我也欢迎你,提供你对此的评论,并为我提供一个不会让内核受损的malloc函数!
ahhhhhhh!内核很难玩!!!