使用hdfs作为分布式tensorflow的MonitoredTrainingSession的logdirs时卡住了

时间:2017-10-09 16:21:09

标签: hadoop tensorflow hdfs distributed

我正在使用图形副本与hdfs,但是当MonitoredTrainingSession-logdirs的参数设置为hdfs:// default / mypath4train_logs时,我的程序卡住了。代码如下:

with tf.train.MonitoredTrainingSession(master=server.target,
                                           is_chief=(worker_index==0),
                                           checkpoint_dir="hdfs://default/mypath4train_logs/train_logs",
                                           config=sess_config) as sess:

我已阅读了有关tensorflow指南how to run TensorFlow on hadoop的文档,并完成了所说的内容,其中最重要的是,libjvm.so和libhdfs.so的路径已附加到LD_LIBRARY_PATH。

然而,主要工作人员的日志停在:

INFO:tensorflow:Create CheckpointSaverHook.
2017-10-09 23:34:55,575 Create CheckpointSaverHook.

我调试了它,似乎程序卡在了tensorflow :: FileSystem :: RecursivelyCreateDir上,但没有任何错误。这就是我调试它的方式:

(gdb) next
Single stepping until exit from function _ZNSsC1ERKSs@plt,
which has no line number information.
0x00007fc181431260 in std::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::basic_string(std::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const&) ()
from /usr/lib64/libstdc++.so.6
(gdb) next
Single stepping until exit from function _ZNSsC2ERKSs,
which has no line number information.
0x00007fc1a1aeddcd in 
tensorflow::HadoopFileSystem::Connect(tensorflow::StringPiece, hdfs_internal**) ()
from /search/odin/tensorflow/py34tf1.3env/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
(gdb) next
Single stepping until exit from function _ZN10tensorflow16HadoopFileSystem7ConnectENS_11StringPieceEPP13hdfs_internal,
which has no line number information.
0x00007fc1a1aef752 in tensorflow::HadoopFileSystem::FileExists(std::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const&) ()
from /search/odin/tensorflow/py34tf1.3env/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
(gdb) next
Single stepping until exit from function _ZN10tensorflow16HadoopFileSystem10FileExistsERKSs,
which has no line number information.
0x00007fc1a1d31ffa in 
tensorflow::FileSystem::RecursivelyCreateDir(std::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const&) ()
from /search/odin/tensorflow/py34tf1.3env/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
(gdb) s
Single stepping until exit from function _ZN10tensorflow10FileSystem20RecursivelyCreateDirERKSs,
which has no line number information.

请注意,gdb的最后一个命令从未返回。

我也发现这些类似的问题尚未解决或对我的问题无益:

Distributed Tensorflow 1.0 Supervisor stuck if logdir is in HDFS

How to use hdfs directory path in tf.train.MonitoredTrainingSession API for writing logs and checkpoints

0 个答案:

没有答案