你能解释一下分布式Tensorflow教程的例子吗?

时间:2017-09-05 18:51:13

标签: tensorflow distributed-computing distributed-system tensorflow-serving

我对分布式计算世界有点新鲜。我正在阅读官方tensorflow教程中的following,但我对本教程主要示例中的内容感到非常困惑。

特别是,ps工作和工人如何互动? ps工作的作用究竟是什么?他们在代码中的相应部分是非常有限的,他们似乎没有做太多,所以他们的目的是什么?我想我不明白我们的分布式系统的各个部分如何协同工作。

如果有人能够解释在最后根据不同进程及其操作执行shell命令时发生的事情,那将是很棒的。

以下是参考的主要代码:

import argparse
import sys

import tensorflow as tf

FLAGS = None

def main(_):
  ps_hosts = FLAGS.ps_hosts.split(",")
  worker_hosts = FLAGS.worker_hosts.split(",")

  # Create a cluster from the parameter server and worker hosts.
  cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

  # Create and start a server for the local task.
  server = tf.train.Server(cluster,
                           job_name=FLAGS.job_name,
                           task_index=FLAGS.task_index)

  if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":

    # Assigns ops to the local worker by default.
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster)):

      # Build model...
      loss = ...
      global_step = tf.contrib.framework.get_or_create_global_step()

      train_op = tf.train.AdagradOptimizer(0.01).minimize(
          loss, global_step=global_step)

    # The StopAtStepHook handles stopping after running given steps.
    hooks=[tf.train.StopAtStepHook(last_step=1000000)]

    # The MonitoredTrainingSession takes care of session initialization,
    # restoring from a checkpoint, saving to a checkpoint, and closing when done
    # or an error occurs.
    with tf.train.MonitoredTrainingSession(master=server.target,
                                           is_chief=(FLAGS.task_index == 0),
                                           checkpoint_dir="/tmp/train_logs",
                                           hooks=hooks) as mon_sess:
      while not mon_sess.should_stop():
        # Run a training step asynchronously.
        # See `tf.train.SyncReplicasOptimizer` for additional details on how to
        # perform *synchronous* training.
        # mon_sess.run handles AbortedError in case of preempted PS.
        mon_sess.run(train_op)

if __name__ == "__main__":
  parser = argparse.ArgumentParser()
  parser.register("type", "bool", lambda v: v.lower() == "true")
  # Flags for defining the tf.train.ClusterSpec
  parser.add_argument(
      "--ps_hosts",
      type=str,
      default="",
      help="Comma-separated list of hostname:port pairs"
  )
  parser.add_argument(
      "--worker_hosts",
      type=str,
      default="",
      help="Comma-separated list of hostname:port pairs"
  )
  parser.add_argument(
      "--job_name",
      type=str,
      default="",
      help="One of 'ps', 'worker'"
  )
  # Flags for defining the tf.train.Server
  parser.add_argument(
      "--task_index",
      type=int,
      default=0,
      help="Index of task within the job"
  )
  FLAGS, unparsed = parser.parse_known_args()
  tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

这是shell命令:

  $ python trainer.py\
  --ps_hosts = ps0.example.com: 2222, ps1.example.com: 2222\
  --worker_hosts = worker0.example.com: 2222, worker1.example.com: 2222\
  --job_name = ps--task_index = 0# On ps1.example.com:
  $ python trainer.py\
  --ps_hosts = ps0.example.com: 2222, ps1.example.com: 2222\
  --worker_hosts = worker0.example.com: 2222, worker1.example.com: 2222\
  --job_name = ps--task_index = 1# On worker0.example.com:
  $ python trainer.py\
  --ps_hosts = ps0.example.com: 2222, ps1.example.com: 2222\
  --worker_hosts = worker0.example.com: 2222, worker1.example.com: 2222\
  --job_name = worker--task_index = 0# On worker1.example.com:
  $ python trainer.py\
  --ps_hosts = ps0.example.com: 2222, ps1.example.com: 2222\
  --worker_hosts = worker0.example.com: 2222, worker1.example.com: 2222\
  --job_name = worker--task_index = 1

2 个答案:

答案 0 :(得分:2)

这是情况的示意图。您有4个张量流程。每个进程都运行TensorFlow工作线程,该线程可以执行TensorFlow计算。此外,其中两个进程还运行一个客户端线程,该线程发出session.run个请求。

每个工人流程也是一个"设备"在TensorFlow中,用于在设备上拆分图形执行。您可以通过在图形构造期间执行with tf.device("job:worker/task:0"):之类的操作来告诉TF运行时在worker1设备上执行图形的某些部分。

tf.train.replica_device_setter中发生的神奇事件取代了手动with tf.device注释,并具有跨设备自动分配变量的效果。更具体地说,当你有两个PS分片时,一半的变量将进入ps1设备,另一半进入ps2设备。同时,更新这些变量的图形部分将在每个工作设备上复制。

如果您使用手动设备规范替换replica_device_setter,那么您的工作流程将大致如下所示

with tf.device('ps1'):
  var1 = tf.Variable(...)
with tf.device('ps2'):
  var2 = tf.Variable(...)
with tf.device('worker1'):
  update_op1 = var1.assign_add(grad1)
  update_op2 = var2.assign_add(grad2)

while True:
  sess.run([update_op1, update_op2])

自动处理通信。在worker1客户端线程中执行sess.run(update_op1)时,它将在worker1上计算grad1,然后将结果发送到ps1任务,并触发ps1工作线程更新其值var1

答案 1 :(得分:1)

根据我的理解,ps作业包含不同任务之间的所有共享数据,可以在不同的机器上运行(并且所有共享相同的ps作业)。