您好我正在使用MonitoredTrainingSession
运行tensorflow分布式培训。我使用以下代码创建了会话:
with tf.train.MonitoredTrainingSession(
master=server.target,
is_chief=(task_index == 0), config=config,
checkpoint_dir='./checkpoints',
save_checkpoint_secs=30,
save_summaries_secs=None,
save_summaries_steps=None,
) as sess, train_model.graph.as_default():
...
(_, step_loss, step_predict_count, step_summary, step,
step_word_count, batch_size) = train_model.train(sess)
但是我收到以下错误:
NotFoundError: ./checkpoints/model.ckpt-0_temp_f99363cb89e44afba5df594d2ab72666; No such file or directory
[[Node: save_1/SaveV2_1 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:ps/replica:0/task:0/device:CPU:0"](save_1/ShardedFilename_1, save_1/SaveV2_1/tensor_names, save_1/SaveV2_1/shape_and_slices, dynamic_seq2seq/decoder/multi_rnn_cell/cell_0/basic_lstm_cell/bias_G100, dynamic_seq2seq/decoder/multi_rnn_cell/cell_0/basic_lstm_cell/kernel_G102, dynamic_seq2seq/decoder/multi_rnn_cell/cell_1/basic_lstm_cell/bias_G104, dynamic_seq2seq/decoder/multi_rnn_cell/cell_1/basic_lstm_cell/kernel_G106, dynamic_seq2seq/decoder/output_projection/kernel_G108, dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias_G110, dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel_G112, dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias_G114, dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel_G116)]]
[[Node: save_1/Identity_S37 = _HostRecv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:0/device:CPU:0", send_device_incarnation=-8143037005221622295, tensor_name="edge_63_save_1/Identity", tensor_type=DT_STRING, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
在此错误之前,还有另一条消息说:
INFO:tensorflow:Saving checkpoints for 0 into ./checkpoints/model.ckpt.
2018-06-14 06:03:50,686 INFO (MainThread-4689) Saving checkpoints for 0 into ./checkpoints/model.ckpt.
我没有设置自己的checkpoint saver钩子,我想所有的检查点保存应由MonitoredTrainingSession
管理。所以我确定为什么要写到不同的地方。任何帮助将非常感谢!