我正在尝试扩展此处概述的示例https://www.tensorflow.org/deploy/distributed,但我在保存模型时遇到问题。我在gcr.io/tensorflow/tensorflow:1.5.0-gpu-py3
处可用的docker容器中运行它。我为'ps'启动了两个进程,为'worker'启动了一个进程,ps进程就是这个代码:
import tensorflow as tf
def main(_):
cluster = tf.train.ClusterSpec({"ps":["localhost:2222"],"worker":["localhost:2223"]})
server = tf.train.Server(cluster, job_name="ps", task_index=0)
server.join()
if __name__ == "__main__":
tf.app.run()
工作人员代码如下,基于mnist示例和上面的分发文章:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
data_dir = "/data"
checkpoint_dir = "/tmp/train_logs"
def main(_):
cluster = tf.train.ClusterSpec({"ps":["localhost:2222"],"worker":["localhost:2223"]})
server = tf.train.Server(cluster, job_name="worker", task_index=0)
mnist = input_data.read_data_sets(data_dir, one_hot=True)
with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:0", cluster=cluster)):
x = tf.placeholder(tf.float32, [None,784], name="x_input")
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))
y = tf.placeholder(tf.float32, [None,10])
model = tf.matmul(x, W) + b
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y, logits=model))
global_step = tf.train.get_or_create_global_step()
train_op = tf.train.GradientDescentOptimizer(0.5).minimize(cost, global_step=global_step)
prediction = tf.equal(tf.argmax(model,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))
hooks = [tf.train.StopAtStepHook(last_step=101)]
with tf.train.MonitoredTrainingSession(master=server.target, is_chief=True, checkpoint_dir=checkpoint_dir, hooks=hooks) as sess:
while not sess.should_stop():
batch_xs, batch_ys = mnist.train.next_batch(1000)
sess.run(train_op, feed_dict={x: batch_xs, y: batch_ys})
latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)
#saver = tf.train.Saver()
saver = tf.train.import_meta_graph(latest_checkpoint+".meta", clear_devices=True)
with tf.Session() as sess:
saver.restore(sess,latest_checkpoint) # "/tmp/train_logs/model.ckpt"
acc = sess.run(accuracy, feed_dict={x: mnist.test.images,y: mnist.test.labels});
print("Test accuracy = "+"{:5f}".format(acc))
if __name__ == "__main__":
tf.app.run()
我发现的例子似乎都没有显示如何使用模型。以上代码在saver.restore()
行上失败,并出现以下错误:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'save/RestoreV2_2':
Operation was explicitly assigned to /job:ps/task:0/device:CPU:0
but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0 ].
Make sure the device specification refers to a valid device.
另外,如上所示,我尝试saver = tf.train.Saver()
和saver = tf.train.import_meta_graph(latest_checkpoint+".meta", clear_devices=True)
都没有成功。在任何一种情况下都会显示相同的错误。
我真的不理解with tf.device(...):
声明。在一次迭代中,我注释掉了这一行(并且没有缩进它下面的语句),代码运行没有错误。但我认为这是不正确的,并希望了解正确的方法。