我目前正在使用Distributed Tensorflow开展项目。我的目标是在几台不同的机器上运行几个独立的图形。
作为一个例子,我想做这样的事情(假设服务器在每台机器上打开)
import tensorflow as tf
a = tf.constant(3)
b = tf.constant(2)
x = tf.mul(a,b) # To be run on "grpc://www.example0.com:2222"
y = tf.mul(a,b) # To be run on "grpc://www.example1.com:2222"
z = tf.mul(a,b) # To be run on "grpc://www.example2.com:2222"
with tf.Session() as sess:
sess.run([x,y,z]) # Ops x,y,z are run on different machines in parallel
我目前对此的尝试显示在以下代码中。但是,此代码以串行方式运行会话,但我希望它们以并行分布式方式执行。
import tensorflow as tf
a = tf.constant(3)
b = tf.constant(2)
x = tf.mul(a,b) # To be run on "grpc://www.example0.com:2222"
y = tf.mul(a,b) # To be run on "grpc://www.example1.com:2222"
z = tf.mul(a,b) # To be run on "grpc://www.example2.com:2222"
with tf.Session("grpc://www.example0.com:2222") as sess:
sess.run(x)
with tf.Session("grpc://www.example1.com:2222") as sess:
sess.run(y)
with tf.Session("grpc://www.example2.com:2222") as sess:
sess.run(z)
在阅读有关Distributed Tensorflow的文档时,我发现tf.device
允许我设置运行Tensorflow Ops的CPU或GPU。是否有类似的东西允许我设置session target
来指定哪台机器将运行哪个操作?或者是否有另一种分发Tensorflow Ops的方法?
答案 0 :(得分:1)
我现在正在努力解决这个问题。以下内容主要来自the tensorflow distributed how-to guide。
您可以使用tf.device
clusterspec = \
{ "worker":
[ "www.example0.com:2222"
, "www.example1.com:2222"
, "www.example2.com:2222"
]
, "master":
[ "localhost:2222" ]
}
cluster = tf.ClusterSpec(clusterspec)
a = tf.constant(3)
b = tf.constant(2)
# pin 'x' to www.example0.com
with tf.device("/job:worker/task:0"):
x = tf.mul(a, b)
# pin 'y' to www.example1.com
with tf.device("/job:worker/task:1"):
y = tf.mul(a, b)
server = tf.train.Server(cluster, job_name="master", task_index=0)
with tf.Session(server.target) as sess:
# run the ops
print(sess.run([x, y]))
但是,至少对我来说,这只适用于所有工作进程与主服务器在同一台机器上的情况。否则,它会挂起 sess.run
。
在集群规范中使用localhost
时遇到了问题。如果您在服务器之间共享相同的群集规范,请不要使用localhost
;相反,请使用您认为localhost
引用的计算机的IP地址或主机名。对于上面的示例,假设您正在www.master.com
上运行主脚本。您有两种选择:
在每台服务器上,localhost
指的是运行服务器的计算机。
# on www.example0.com
clusterspec = \
{ "worker":
[ "localhost:2222"
, "www.example1.com:2222"
, "www.example2.com:2222"
]
, "master":
[ "www.master.com:2222" ]
}
cluster = tf.ClusterSpec(clusterspec)
server = tf.train.Server(cluster, job_name="worker", task_index=0)
server.join()
# on www.example1.com
clusterspec = \
{ "worker":
[ "www.example0.com:2222"
, "localhost:2222"
, "www.example2.com:2222"
]
, "master":
[ "www.master.com:2222" ]
}
cluster = tf.ClusterSpec(clusterspec)
server = tf.train.Server(cluster, job_name="worker", task_index=1)
server.join()
# on www.example2.com
clusterspec = \
{ "worker":
[ "www.example0.com:2222"
, "www.example1.com:2222"
, "localhost:2222"
]
, "master":
[ "www.master.com:2222" ]
}
cluster = tf.ClusterSpec(clusterspec)
server = tf.train.Server(cluster, job_name="worker", task_index=2)
server.join()
# on www.master.com
clusterspec = \
{ "worker":
[ "www.example0.com:2222"
, "www.example1.com:2222"
, "www.example2.com:2222"
]
, "master":
[ "localhost:2222" ]
}
cluster = tf.ClusterSpec(clusterspec)
a = tf.constant(3)
b = tf.constant(2)
with tf.device("/job:worker/task:0"):
x = tf.mul(a, b)
with tf.device("/job:worker/task:1"):
y = tf.mul(a, b)
server = tf.train.Server(cluster, job_name="master", task_index=0)
with tf.Session(server.target) as sess:
print(sess.run([x, y]))
一个群集规范,使用可从每个节点都可以看到的IP地址/域名。
保存在clusterspec.json
:
{ "worker":
[ "www.example0.com:2222"
, "www.example1.com:2222"
, "www.example2.com:2222"
]
, "master":
[ "www.master.com:2222" ]
}
然后对每个工人:
import json
with open('clusterspec.json', 'r') as f:
clusterspec = json.load(f)
cluster = tf.ClusterSpec(clusterspec)
server = tf.train.Server(cluster, job_name="worker", task_index=<INDEX OF TASK>)
然后在主人身上:
import json
with open('clusterspec.json', 'r') as f:
clusterspec = json.load(f)
cluster = tf.ClusterSpec(clusterspec)
a = tf.constant(3)
b = tf.constant(2)
with tf.device("/job:worker/task:0"):
x = tf.mul(a, b)
with tf.device("/job:worker/task:1"):
y = tf.mul(a, b)
server = tf.train.Server(cluster, job_name="master", task_index=0)
with tf.Session(server.target) as sess:
print(sess.run([x, y]))