我正在使用图形副本与hdfs,但是当MonitoredTrainingSession-logdirs的参数设置为hdfs:// default / mypath4train_logs时,我的程序卡住了。代码如下:
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=(worker_index==0),
checkpoint_dir="hdfs://default/mypath4train_logs/train_logs",
config=sess_config) as sess:
我已阅读了有关tensorflow指南how to run TensorFlow on hadoop的文档,并完成了所说的内容,其中最重要的是,libjvm.so和libhdfs.so的路径已附加到LD_LIBRARY_PATH。
然而,主要工作人员的日志停在:
INFO:tensorflow:Create CheckpointSaverHook.
2017-10-09 23:34:55,575 Create CheckpointSaverHook.
我调试了它,似乎程序卡在了tensorflow :: FileSystem :: RecursivelyCreateDir上,但没有任何错误。这就是我调试它的方式:
(gdb) next
Single stepping until exit from function _ZNSsC1ERKSs@plt,
which has no line number information.
0x00007fc181431260 in std::basic_string<char, std::char_traits<char>,
std::allocator<char> >::basic_string(std::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&) ()
from /usr/lib64/libstdc++.so.6
(gdb) next
Single stepping until exit from function _ZNSsC2ERKSs,
which has no line number information.
0x00007fc1a1aeddcd in
tensorflow::HadoopFileSystem::Connect(tensorflow::StringPiece, hdfs_internal**) ()
from /search/odin/tensorflow/py34tf1.3env/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
(gdb) next
Single stepping until exit from function _ZN10tensorflow16HadoopFileSystem7ConnectENS_11StringPieceEPP13hdfs_internal,
which has no line number information.
0x00007fc1a1aef752 in tensorflow::HadoopFileSystem::FileExists(std::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&) ()
from /search/odin/tensorflow/py34tf1.3env/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
(gdb) next
Single stepping until exit from function _ZN10tensorflow16HadoopFileSystem10FileExistsERKSs,
which has no line number information.
0x00007fc1a1d31ffa in
tensorflow::FileSystem::RecursivelyCreateDir(std::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&) ()
from /search/odin/tensorflow/py34tf1.3env/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
(gdb) s
Single stepping until exit from function _ZN10tensorflow10FileSystem20RecursivelyCreateDirERKSs,
which has no line number information.
请注意,gdb的最后一个命令从未返回。
我也发现这些类似的问题尚未解决或对我的问题无益:
Distributed Tensorflow 1.0 Supervisor stuck if logdir is in HDFS