Question

我想在多台机器上运行tensorflow，多个GPU。作为初始步骤，在单台机器上尝试分布式tensorflow（跟随tensorflow教程https://www.tensorflow.org/how_tos/distributed/）

Bellow是sess.run（）卡住

之后的行

import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=0)
a = tf.constant(8)
b = tf.constant(9)
sess = tf.Session('grpc://localhost:2222')

直到这里一切正常，但是当我运行sess.run（）时，它会卡住。

    sess.run(tf.mul(a,b))

如果有人已经处理过分布式tensorflow，请告诉我解决方案或其他工作正常的教程。

Answer 1

默认情况下，Distributed TensorFlow将阻塞，直到tf.train.ClusterSpec中指定的所有服务器都已启动。这在与服务器的第一次交互期间发生，这通常是第一次sess.run()呼叫。因此，如果您还没有启动监听localhost:2223的服务器，那么TensorFlow将会阻止，直到您这样做。

根据您以后的目标，这个问题有一些解决方案：

在localhost:2223上启动服务器。在另一个过程中，运行以下脚本：

 import tensorflow as tf
 cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
 server = tf.train.Server(cluster, job_name="local", task_index=1)
 server.join()  # Wait forever for incoming connections.

从原始tf.train.ClusterSpec删除任务1：

 import tensorflow as tf
 cluster = tf.train.ClusterSpec({"local": ["localhost:2222"]})
 server = tf.train.Server(cluster, job_name="local", task_index=0)
 # ...

创建tf.Session时指定“设备过滤器”，以便会话仅使用任务0。

 # ...
 sess = tf.Session("grpc://localhost:2222",
                   config=tf.ConfigProto(device_filters=["/job:local/task:0"]))

分布式Tensorflow陷入了sess.run（）

1 个答案: