GCE 8 GPU实例在训练运行时随机重新启动

时间:2017-06-09 22:10:04

标签: tensorflow google-cloud-platform google-compute-engine

我有一个8 GPU GCE实例,在训练例程中随机重新启动。这发生了几次。该实例在重新启动之前似乎也会停留很长一段时间。我在转储的内核日志中发现了一些看起来可能是原因的痕迹(?)。有什么想法我可以做些什么吗?

配置非常普通:运行python 3 Tensorflow应用程序的ubuntu实例使用cuda工具包安装了对图像和Nvidia驱动程序的培训。

日志如下所示。表示系统的最后几行正在启动,但几乎在10小时后出现

Jun  7 19:23:59 gpu-8-2 kernel: [62064.749736] Call Trace:
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749737]  <IRQ>  [<ffffffff813f8dd3>] dump_stack+0x63/0x90
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749746]  [<ffffffff810ddd33>] __report_bad_irq+0x33/0xc0
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749747]  [<ffffffff810de0c7>] note_interrupt+0x247/0x290
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749749]  [<ffffffff810db277>] handle_irq_event_percpu+0x167/0x1d0
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749750]  [<ffffffff810db31e>] handle_irq_event+0x3e/0x60
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749752]  [<ffffffff810de639>] handle_fasteoi_irq+0x99/0x150
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749756]  [<ffffffff8103119d>] handle_irq+0x1d/0x30
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749758]  [<ffffffff8184341b>] do_IRQ+0x4b/0xd0
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749761]  [<ffffffff81841502>] common_interrupt+0x82/0x82
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749764]  [<ffffffff81085d5e>] ? __do_softirq+0x7e/0x290
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749766]  [<ffffffff810860e3>] irq_exit+0xa3/0xb0
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749767]  [<ffffffff818434e2>] smp_apic_timer_interrupt+0x42/0x50
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749769]  [<ffffffff818417a2>] apic_timer_interrupt+0x82/0x90
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749770]  <EOI>  [<ffffffff81064606>] ? native_safe_halt+0x6/0x10
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749775]  [<ffffffff81038e1e>] default_idle+0x1e/0xe0
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749776]  [<ffffffff8103962f>] arch_cpu_idle+0xf/0x20
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749780]  [<ffffffff810c454a>] default_idle_call+0x2a/0x40
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749781]  [<ffffffff810c48b1>] cpu_startup_entry+0x2f1/0x350
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749798]  [<ffffffff810517c4>] start_secondary+0x154/0x190
Jun  7 19:23:59 gpu-8-2 kernel: [62064.749799] handlers:
Jun  7 19:23:59 gpu-8-2 kernel: [62064.752277] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun  7 19:23:59 gpu-8-2 kernel: [62064.762984] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun  7 19:23:59 gpu-8-2 kernel: [62064.773705] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun  7 19:23:59 gpu-8-2 kernel: [62064.784444] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun  7 19:23:59 gpu-8-2 kernel: [62064.795096] Disabling IRQ #10
Jun  8 05:27:43 gpu-8-2 kernel: [    0.000000] Initializing cgroup subsys cpuset
Jun  8 05:27:43 gpu-8-2 kernel: [    0.000000] Initializing cgroup subsys cpu
Jun  8 05:27:43 gpu-8-2 kernel: [    0.000000] Initializing cgroup subsys cpuacct
Jun  8 05:27:43 gpu-8-2 kernel: [    0.000000] Linux version 4.4.0-79-generic (buildd@lcy01-30) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #100-Ubuntu SMP Wed May 
17 19:58:14 UTC 2017 (Ubuntu 4.4.0-79.100-generic 4.4.67)

0 个答案:

没有答案