MonitoredTrainingSession保存和恢复模型

时间:2018-03-08 16:05:07

标签: python tensorflow distributed

我正在尝试扩展此处概述的示例https://www.tensorflow.org/deploy/distributed,但我在保存模型时遇到问题。我在gcr.io/tensorflow/tensorflow:1.5.0-gpu-py3处可用的docker容器中运行它。我为'ps'启动了两个进程,为'worker'启动了一个进程,ps进程就是这个代码:

import tensorflow as tf
def main(_):
   cluster = tf.train.ClusterSpec({"ps":["localhost:2222"],"worker":["localhost:2223"]})
   server = tf.train.Server(cluster, job_name="ps", task_index=0)
   server.join()
if __name__ == "__main__":
   tf.app.run()

工作人员代码如下,基于mnist示例和上面的分发文章:

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

data_dir = "/data"
checkpoint_dir = "/tmp/train_logs"

def main(_):
   cluster = tf.train.ClusterSpec({"ps":["localhost:2222"],"worker":["localhost:2223"]})
   server = tf.train.Server(cluster, job_name="worker", task_index=0)
   mnist = input_data.read_data_sets(data_dir, one_hot=True)

   with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:0", cluster=cluster)):
     x = tf.placeholder(tf.float32, [None,784], name="x_input")
     W = tf.Variable(tf.zeros([784,10]))
     b = tf.Variable(tf.zeros([10]))
     y = tf.placeholder(tf.float32, [None,10])
     model = tf.matmul(x, W) + b
     cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y, logits=model))
     global_step = tf.train.get_or_create_global_step()
     train_op = tf.train.GradientDescentOptimizer(0.5).minimize(cost, global_step=global_step)
     prediction = tf.equal(tf.argmax(model,1), tf.argmax(y,1))
     accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))

  hooks = [tf.train.StopAtStepHook(last_step=101)]
  with tf.train.MonitoredTrainingSession(master=server.target, is_chief=True, checkpoint_dir=checkpoint_dir, hooks=hooks) as sess:
     while not sess.should_stop():
        batch_xs, batch_ys = mnist.train.next_batch(1000)
        sess.run(train_op, feed_dict={x: batch_xs, y: batch_ys})

  latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)
  #saver = tf.train.Saver()
  saver = tf.train.import_meta_graph(latest_checkpoint+".meta", clear_devices=True)
  with tf.Session() as sess:
     saver.restore(sess,latest_checkpoint) # "/tmp/train_logs/model.ckpt"
     acc = sess.run(accuracy, feed_dict={x: mnist.test.images,y: mnist.test.labels});
     print("Test accuracy = "+"{:5f}".format(acc))

if __name__ == "__main__":
   tf.app.run()

我发现的例子似乎都没有显示如何使用模型。以上代码在saver.restore()行上失败,并出现以下错误:

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'save/RestoreV2_2': 
Operation was explicitly assigned to /job:ps/task:0/device:CPU:0 
but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0 ].
Make sure the device specification refers to a valid device.

另外,如上所示,我尝试saver = tf.train.Saver()saver = tf.train.import_meta_graph(latest_checkpoint+".meta", clear_devices=True)都没有成功。在任何一种情况下都会显示相同的错误。

我真的不理解with tf.device(...):声明。在一次迭代中,我注释掉了这一行(并且没有缩进它下面的语句),代码运行没有错误。但我认为这是不正确的,并希望了解正确的方法。

0 个答案:

没有答案