我有一个8 GPU GCE实例,在训练例程中随机重新启动。这发生了几次。该实例在重新启动之前似乎也会停留很长一段时间。我在转储的内核日志中发现了一些看起来可能是原因的痕迹(?)。有什么想法我可以做些什么吗?
配置非常普通:运行python 3 Tensorflow应用程序的ubuntu实例使用cuda工具包安装了对图像和Nvidia驱动程序的培训。
日志如下所示。表示系统的最后几行正在启动,但几乎在10小时后出现
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749736] Call Trace:
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749737] <IRQ> [<ffffffff813f8dd3>] dump_stack+0x63/0x90
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749746] [<ffffffff810ddd33>] __report_bad_irq+0x33/0xc0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749747] [<ffffffff810de0c7>] note_interrupt+0x247/0x290
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749749] [<ffffffff810db277>] handle_irq_event_percpu+0x167/0x1d0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749750] [<ffffffff810db31e>] handle_irq_event+0x3e/0x60
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749752] [<ffffffff810de639>] handle_fasteoi_irq+0x99/0x150
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749756] [<ffffffff8103119d>] handle_irq+0x1d/0x30
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749758] [<ffffffff8184341b>] do_IRQ+0x4b/0xd0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749761] [<ffffffff81841502>] common_interrupt+0x82/0x82
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749764] [<ffffffff81085d5e>] ? __do_softirq+0x7e/0x290
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749766] [<ffffffff810860e3>] irq_exit+0xa3/0xb0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749767] [<ffffffff818434e2>] smp_apic_timer_interrupt+0x42/0x50
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749769] [<ffffffff818417a2>] apic_timer_interrupt+0x82/0x90
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749770] <EOI> [<ffffffff81064606>] ? native_safe_halt+0x6/0x10
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749775] [<ffffffff81038e1e>] default_idle+0x1e/0xe0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749776] [<ffffffff8103962f>] arch_cpu_idle+0xf/0x20
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749780] [<ffffffff810c454a>] default_idle_call+0x2a/0x40
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749781] [<ffffffff810c48b1>] cpu_startup_entry+0x2f1/0x350
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749798] [<ffffffff810517c4>] start_secondary+0x154/0x190
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749799] handlers:
Jun 7 19:23:59 gpu-8-2 kernel: [62064.752277] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun 7 19:23:59 gpu-8-2 kernel: [62064.762984] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun 7 19:23:59 gpu-8-2 kernel: [62064.773705] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun 7 19:23:59 gpu-8-2 kernel: [62064.784444] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun 7 19:23:59 gpu-8-2 kernel: [62064.795096] Disabling IRQ #10
Jun 8 05:27:43 gpu-8-2 kernel: [ 0.000000] Initializing cgroup subsys cpuset
Jun 8 05:27:43 gpu-8-2 kernel: [ 0.000000] Initializing cgroup subsys cpu
Jun 8 05:27:43 gpu-8-2 kernel: [ 0.000000] Initializing cgroup subsys cpuacct
Jun 8 05:27:43 gpu-8-2 kernel: [ 0.000000] Linux version 4.4.0-79-generic (buildd@lcy01-30) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #100-Ubuntu SMP Wed May
17 19:58:14 UTC 2017 (Ubuntu 4.4.0-79.100-generic 4.4.67)