tensorflow MonitoredTrainingSession checkpoint hook error:没有这样的文件或目录

时间:2018-06-14 06:19:36

标签: tensorflow pyspark

您好我正在使用MonitoredTrainingSession运行tensorflow分布式培训。我使用以下代码创建了会话:

with tf.train.MonitoredTrainingSession(
        master=server.target,
        is_chief=(task_index == 0), config=config,
        checkpoint_dir='./checkpoints',
        save_checkpoint_secs=30,
        save_summaries_secs=None,
        save_summaries_steps=None,
        ) as sess, train_model.graph.as_default():

    ...

    (_, step_loss, step_predict_count, step_summary, step,
                 step_word_count, batch_size) = train_model.train(sess)

但是我收到以下错误:

NotFoundError: ./checkpoints/model.ckpt-0_temp_f99363cb89e44afba5df594d2ab72666; No such file or directory
 [[Node: save_1/SaveV2_1 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:ps/replica:0/task:0/device:CPU:0"](save_1/ShardedFilename_1, save_1/SaveV2_1/tensor_names, save_1/SaveV2_1/shape_and_slices, dynamic_seq2seq/decoder/multi_rnn_cell/cell_0/basic_lstm_cell/bias_G100, dynamic_seq2seq/decoder/multi_rnn_cell/cell_0/basic_lstm_cell/kernel_G102, dynamic_seq2seq/decoder/multi_rnn_cell/cell_1/basic_lstm_cell/bias_G104, dynamic_seq2seq/decoder/multi_rnn_cell/cell_1/basic_lstm_cell/kernel_G106, dynamic_seq2seq/decoder/output_projection/kernel_G108, dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias_G110, dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel_G112, dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias_G114, dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel_G116)]]
 [[Node: save_1/Identity_S37 = _HostRecv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:0/device:CPU:0", send_device_incarnation=-8143037005221622295, tensor_name="edge_63_save_1/Identity", tensor_type=DT_STRING, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]

在此错误之前,还有另一条消息说:

INFO:tensorflow:Saving checkpoints for 0 into ./checkpoints/model.ckpt.
2018-06-14 06:03:50,686 INFO (MainThread-4689) Saving checkpoints for 0 into ./checkpoints/model.ckpt.

我没有设置自己的checkpoint saver钩子,我想所有的检查点保存应由MonitoredTrainingSession管理。所以我确定为什么要写到不同的地方。任何帮助将非常感谢!

0 个答案:

没有答案