我正在尝试基于this tf official tutorial运行一些示例代码。
我看着this video真好。
如上面的视频中所述,首席工作人员负责保存检查点,并且由tf.train.MonitoredTrainingSession实现。
然后我认为只有首席工作人员需要一个目录来保存检查点。
当我在machine1上以ps0运行代码,在machine2上以worker0运行代码时,一切似乎正常。
但是,当我使用ps0,machine1上的worker0,machine2上的ps1和worker1运行时,会发生错误,并且worker0日志中的错误类似于:
Traceback (most recent call last):
File "distributed_train.py", line 136, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "distributed_train.py", line 97, in main
hooks=hooks) as mon_sess:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 415, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 826, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 549, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1012, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1017, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 712, in create_session
hook.after_create_session(self.tf_sess, self.coord)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 450, in after_create_session
self._save(session, global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 481, in _save
self._get_saver().save(session, self._save_path, global_step=step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1669, in save
raise exc
tensorflow.python.framework.errors_impl.NotFoundError: ./train_dir/dist_worker_0/model.ckpt-0_temp_cf2b45f059b74507a65cae9b7a9ea5b4; No such file or directory
[[Node: save/SaveV2_1 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:ps/replica:0/task:1/device:CPU:0"](save/ShardedFilename_1, save/SaveV2_1/tensor_names, save/SaveV2_1/shape_and_slices, conv1/biases, conv1/biases/Adagrad, conv2/biases, conv2/biases/Adagrad, local3/biases, local3/biases/Adagrad, local4/biases, local4/biases/Adagrad, softmax_linear/biases, softmax_linear/biases/Adagrad)]]
Caused by op u'save/SaveV2_1', defined at:
File "distributed_train.py", line 136, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "distributed_train.py", line 97, in main
hooks=hooks) as mon_sess:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 415, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 826, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 549, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1012, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1017, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 706, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 468, in create_session
self._scaffold.finalize()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 212, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 856, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1284, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1296, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1333, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 772, in _build_internal
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 363, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 337, in _AddShardedSaveOpsForV2
sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/s`enter code here`aver.py", line 278, in _AddSaveOps
save = self.save_op(filename_tensor, saveables)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 194, in save_op
tensors)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1687, in save_v2
shape_and_slices=shape_and_slices, tensors=tensors, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
NotFoundError (see above for traceback): ./train_dir/dist_worker_0/model.ckpt-0_temp_cf2b45f059b74507a65cae9b7a9ea5b4; No such file or directory
[[Node: save/SaveV2_1 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:ps/replica:0/task:1/device:CPU:0"](save/ShardedFilename_1, save/SaveV2_1/tensor_names, save/SaveV2_1/shape_and_slices, conv1/biases, conv1/biases/Adagrad, conv2/biases, conv2/biases/Adagrad, local3/biases, local3/biases/Adagrad, local4/biases, local4/biases/Adagrad, softmax_linear/biases, softmax_linear/biases/Adagrad)]]
但是目录./train_dir/dist_worker_0/model.ckpt-0_temp_cf2b45f059b74507a65cae9b7a9ea5b4确实存在(在machine1上)。
部分代码(实际上来自官方教程):
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(
master=server.target,
config=config,
is_chief=(FLAGS.task_index == 0),
checkpoint_dir="./train_dir/dist_{0}_{1}".format(FLAGS.job_name,
FLAGS.task_index),
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Run a training step asynchronously.
# See <a href="./../api_docs/python/tf/train/SyncReplicasOptimizer"><code>tf.train.SyncReplicasOptimizer</code></a> for additional details on how to
# perform *synchronous* training.
# mon_sess.run handles AbortedError in case of preempted PS.
mon_sess.run(train_op)
我在stackoverflow和github上搜索了一些问题,对类似问题的回答建议使用HDFS。
“首席工作人员负责保存检查点”不是仅在首席工作人员所在的计算机上需要一个本地目录吗?我误会了吗?我真的需要使用HDFS等吗?