Question

我目前正在使用Distributed Tensorflow开展项目。我的目标是在几台不同的机器上运行几个独立的图形。

作为一个例子，我想做这样的事情（假设服务器在每台机器上打开）

import tensorflow as tf
a = tf.constant(3)
b = tf.constant(2)
x = tf.mul(a,b)             # To be run on "grpc://www.example0.com:2222"
y = tf.mul(a,b)             # To be run on "grpc://www.example1.com:2222"
z = tf.mul(a,b)             # To be run on "grpc://www.example2.com:2222"

with tf.Session() as sess:
    sess.run([x,y,z])       # Ops x,y,z are run on different machines in parallel

我目前对此的尝试显示在以下代码中。但是，此代码以串行方式运行会话，但我希望它们以并行分布式方式执行。

import tensorflow as tf
a = tf.constant(3)
b = tf.constant(2)
x = tf.mul(a,b)             # To be run on "grpc://www.example0.com:2222"
y = tf.mul(a,b)             # To be run on "grpc://www.example1.com:2222"
z = tf.mul(a,b)             # To be run on "grpc://www.example2.com:2222"

with tf.Session("grpc://www.example0.com:2222") as sess:
    sess.run(x)
with tf.Session("grpc://www.example1.com:2222") as sess:
    sess.run(y)
with tf.Session("grpc://www.example2.com:2222") as sess:
    sess.run(z)

在阅读有关Distributed Tensorflow的文档时，我发现tf.device允许我设置运行Tensorflow Ops的CPU或GPU。是否有类似的东西允许我设置session target来指定哪台机器将运行哪个操作？或者是否有另一种分发Tensorflow Ops的方法？

Answer 1

我现在正在努力解决这个问题。以下内容主要来自the tensorflow distributed how-to guide。

您可以使用tf.device

将操作固定到作业/任务

clusterspec = \
    { "worker": 
        [ "www.example0.com:2222"
        , "www.example1.com:2222"
        , "www.example2.com:2222"
        ]
    , "master":
        [ "localhost:2222" ]
    }

cluster = tf.ClusterSpec(clusterspec)

a = tf.constant(3)
b = tf.constant(2)

# pin 'x' to www.example0.com
with tf.device("/job:worker/task:0"):
    x = tf.mul(a, b)

# pin 'y' to www.example1.com
with tf.device("/job:worker/task:1"):
    y = tf.mul(a, b)

server = tf.train.Server(cluster, job_name="master", task_index=0)
with tf.Session(server.target) as sess:
     # run the ops
     print(sess.run([x, y]))

~~但是，至少对我来说，这只适用于所有工作进程与主服务器在同一台机器上的情况。否则，它会挂起sess.run。~~

在集群规范中使用localhost时遇到了问题。如果您在服务器之间共享相同的群集规范，请不要使用localhost;相反，请使用您认为localhost引用的计算机的IP地址或主机名。对于上面的示例，假设您正在www.master.com上运行主脚本。您有两种选择：

1。每个服务器使用localhost

一个clusterpec

在每台服务器上，localhost指的是运行服务器的计算机。

# on www.example0.com
clusterspec = \
    { "worker":
        [ "localhost:2222"
        , "www.example1.com:2222"
        , "www.example2.com:2222"
        ]
    , "master":
        [ "www.master.com:2222" ]
    }

cluster = tf.ClusterSpec(clusterspec)
server = tf.train.Server(cluster, job_name="worker", task_index=0)
server.join()

# on www.example1.com
clusterspec = \
    { "worker":
        [ "www.example0.com:2222"
        , "localhost:2222"
        , "www.example2.com:2222"
        ]
    , "master":
        [ "www.master.com:2222" ]
    }

cluster = tf.ClusterSpec(clusterspec)
server = tf.train.Server(cluster, job_name="worker", task_index=1)
server.join()

# on www.example2.com
clusterspec = \
    { "worker":
        [ "www.example0.com:2222"
        , "www.example1.com:2222"
        , "localhost:2222"
        ]
    , "master":
        [ "www.master.com:2222" ]
    }

cluster = tf.ClusterSpec(clusterspec)
server = tf.train.Server(cluster, job_name="worker", task_index=2)
server.join()

# on www.master.com
clusterspec = \
    { "worker":
        [ "www.example0.com:2222"
        , "www.example1.com:2222"
        , "www.example2.com:2222"
        ]
    , "master":
        [ "localhost:2222" ]
    }

cluster = tf.ClusterSpec(clusterspec)

a = tf.constant(3)
b = tf.constant(2)

with tf.device("/job:worker/task:0"):
    x = tf.mul(a, b)

with tf.device("/job:worker/task:1"):
    y = tf.mul(a, b)

server = tf.train.Server(cluster, job_name="master", task_index=0)
with tf.Session(server.target) as sess:
     print(sess.run([x, y]))

2。共享群集

一个群集规范，使用可从每个节点都可以看到的IP地址/域名。

保存在clusterspec.json：

{ "worker":
  [ "www.example0.com:2222"
  , "www.example1.com:2222"
  , "www.example2.com:2222"
  ]
, "master":
  [ "www.master.com:2222" ]
}

然后对每个工人：

import json

with open('clusterspec.json', 'r') as f:
    clusterspec = json.load(f)

cluster = tf.ClusterSpec(clusterspec)
server = tf.train.Server(cluster, job_name="worker", task_index=<INDEX OF TASK>)

然后在主人身上：

import json

with open('clusterspec.json', 'r') as f:
    clusterspec = json.load(f)

cluster = tf.ClusterSpec(clusterspec)

a = tf.constant(3)
b = tf.constant(2)

with tf.device("/job:worker/task:0"):
    x = tf.mul(a, b)

with tf.device("/job:worker/task:1"):
    y = tf.mul(a, b)

server = tf.train.Server(cluster, job_name="master", task_index=0)
with tf.Session(server.target) as sess:
     print(sess.run([x, y]))

在Distributed Tensorflow中跨多台计算机分发图表

1 个答案:

1。每个服务器使用localhost

2。共享群集