Question

我创建了一个4节点tensorflow集群，2个worker，2 ps。当worker或ps出现故障时，我会在机器上使用相同的配置重新启动它。但是，它无法从检查点继续工作。分布式tensorflow是否仍然不支持故障转移？

此文件＆＃34; model.ckpt-24＆＃34;仅在机器上（工人的任务0），跟踪如下。

我创建了一个4节点tensorflow集群，2个worker，2 ps。当worker或ps出现故障时，我会在机器上使用相同的配置重新启动它。但是，它无法从检查点继续工作。分布式tensorflow是否仍然不支持故障转移？

此文件＆＃34; model.ckpt-24＆＃34;仅在机器上（工人的任务0），跟踪如下。

Traceback (most recent call last):
File "/dump/9/nm-local-dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py", line 107, in <module>
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/dump/9/nm-local-dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py", line 87, in main
with sv.managed_session(server.target) as sess:
File "/usr/lib64/python2.7/contextlib.py", line 17, in enter
return self.gen.next()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 969, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 797, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 958, in managed_session
start_standard_services=start_standard_services)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 715, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 227, in prepare_session
config=config)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 173, in _restore_checkpoint
saver.restore(sess, ckpt.model_checkpoint_path)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1345, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 717, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 915, in _run
feed_dict_string, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _do_run
target_list, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 985, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /dump/6/nm-logs/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/model.ckpt-24: Not found: /dump/6/nm-logs/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003
[[Node: save/restore_slice_1 = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:ps/replica:0/task:1/cpu:0"](_recv_save/Const_0_S3, save/restore_slice_1/tensor_name, save/restore_slice_1/shape_and_slice)]]

Caused by op u'save/restore_slice_1', defined at:
File "/dump/9/nm-local-dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py", line 107, in <module>
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/dump/9/nm-local-dir/usercache/danrtsey.wy/appcache/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/app/install/trainer.py", line 71, in main
saver = tf.train.Saver()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 986, in init
self.build()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1015, in build
restore_sequentially=self._restore_sequentially)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 620, in build
restore_sequentially, reshape)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 357, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 270, in restore_op
preferred_shard=preferred_shard))
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/io_ops.py", line 204, in _restore_slice
preferred_shard, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 359, in _restore_slice
preferred_shard=preferred_shard, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1298, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on /dump/6/nm-logs/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003/model.ckpt-24: Not found: /dump/6/nm-logs/application_1477899492621_0004/container_e07_1477899492621_0004_01_000003
[[Node: save/restore_slice_1 = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:ps/replica:0/task:1/cpu:0"](_recv_save/Const_0_S3, save/restore_slice_1/tensor_name, save/restore_slice_1/shape_and_slice)]]

Answer 1

Tensorflow可以从检查点恢复。只需使用HDFS保存检查点就可以解决这个问题。

如何进行分布式tensorFlow支持故障转移？

1 个答案: