为什么tensorflow只是输出被杀死了

时间:2017-08-17 13:02:25

标签: tensorflow

当我运行我的tensorflow应用程序时,它只输出" kill"。我该如何调试呢?

source code

root@8e4a3a65184e:~/tensorflow# python sample_cnn.py 
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 1, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_steps': None, '_model_dir': 'data/convnet_model', '_save_summary_steps': 100}
INFO:tensorflow:Create CheckpointSaverHook.
2017-08-17 12:56:53.160481: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-17 12:56:53.160536: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-17 12:56:53.160545: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-17 12:56:53.160550: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-17 12:56:53.160555: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Killed

3 个答案:

答案 0 :(得分:11)

当我运行你的代码时,我得到了相同的行为,在输入dmesg之后你会看到类似的痕迹,这证实了gdelab暗示的内容:

[38607.234089] python3 invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
[38607.234090] python3 cpuset=/ mems_allowed=0
[38607.234094] CPU: 3 PID: 1420 Comm: python3 Tainted: G           O    4.9.0-3-amd64 #1 Debian 4.9.30-2+deb9u2
[38607.234094] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.2.4 03/29/2017
[38607.234096]  0000000000000000 ffffffffa9f28414 ffffa50090317cf8 ffff940effa5f040
[38607.234097]  ffffffffa9dfe050 0000000000000000 0000000000000000 0101ffffa9d82dd0
[38607.234098]  e09c7db7f06d0ac2 00000000ffffffff 0000000000000000 0000000000000000
[38607.234100] Call Trace:
[38607.234104]  [<ffffffffa9f28414>] ? dump_stack+0x5c/0x78
[38607.234106]  [<ffffffffa9dfe050>] ? dump_header+0x78/0x1fd
[38607.234108]  [<ffffffffa9d8047a>] ? oom_kill_process+0x21a/0x3e0
[38607.234109]  [<ffffffffa9d800fd>] ? oom_badness+0xed/0x170
[38607.234110]  [<ffffffffa9d80911>] ? out_of_memory+0x111/0x470
[38607.234111]  [<ffffffffa9d85b4f>] ? __alloc_pages_slowpath+0xb7f/0xbc0
[38607.234112]  [<ffffffffa9d85d8e>] ? __alloc_pages_nodemask+0x1fe/0x260
[38607.234113]  [<ffffffffa9dd7c3e>] ? alloc_pages_vma+0xae/0x260
[38607.234115]  [<ffffffffa9db39ba>] ? handle_mm_fault+0x111a/0x1350
[38607.234117]  [<ffffffffa9c5fd84>] ? __do_page_fault+0x2a4/0x510
[38607.234118]  [<ffffffffaa207658>] ? page_fault+0x28/0x30
...
[38607.234158] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
...
[38607.234332] [ 1396]  1000  1396  4810969  3464995    6959      21        0             0 python3
[38607.234332] Out of memory: Kill process 1396 (python3) score 568 or sacrifice child
[38607.234357] Killed process 1396 (python3) total-vm:19243876kB, anon-rss:13859980kB, file-rss:0kB, shmem-rss:0kB
[38607.720757] oom_reaper: reaped process 1396 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

这基本上意味着python开始消耗太多内存并且内核决定终止进程。如果在代码中添加一些打印件,您会看到mnist_classifier.train()是活动的功能。然而,一些愚蠢的测试(删除日志记录和降低步骤,似乎没有帮助)。

答案 1 :(得分:5)

正如其他评论者所说,您的操作系统会因为内存不足而终止您的进程。您正在尝试构建一个庞大的网络。让我们看看你最后的密集层。它有65536个输入和65536个单位。每个单位对每个输入都有权重,因此使得65536 * 65536 = 4294967296个权重。权重基于您的输入dtype,我认为你的是float64,所以让我们乘以64,你得到32GB的权重(65536 * 65536 * 64/1024/1024/1024/8 = 32)。并且所有这些权重都是单个张量,必须作为一个整体进行操作,因此它必须完全适合RAM。你的系统有32GB的RAM吗?

答案 2 :(得分:4)

您的程序被您的操作系统杀死,Tensorflow不知道为什么,这就是它输出任何内容的原因。这可能是由于内存不足错误造成的。

检查您的syslog是否包含这样的行:

<date> <computer> kernel: [...] Out of memory: Kill process <id> (python) score <...> or sacrifice child

如果是这样,你需要增加python允许的内存,和/或减少你的程序使用的内存。