ps如何分发Tensorflow?

时间:2017-01-05 08:17:16

标签: tensorflow deep-learning

在学习了多gpu样本(multi_gpu_sample)之后,我认为分布张量流类似于multi-gpu,cpu to ps和gpu to worker。

因此,ps的工作是从工人那里收集渐变,然后更新参数并发送并分享给工人。

但在阅读下面的分发Tensorflow样本后,我感到困惑。 似乎ps只做join()ops。

如何理解这一点?谢谢!

import tensorflow as tf

# Flags for defining the tf.train.ClusterSpec
tf.app.flags.DEFINE_string("ps_hosts", "",
                           "Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts", "",
                           "Comma-separated list of hostname:port pairs")

# Flags for defining the tf.train.Server
tf.app.flags.DEFINE_string("job_name", "", "One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")

FLAGS = tf.app.flags.FLAGS


def main(_):
  ps_hosts = FLAGS.ps_hosts.split(",")
  worker_hosts = FLAGS.worker_hosts.split(",")

  # Create a cluster from the parameter server and worker hosts.
  cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

  # Create and start a server for the local task.
  server = tf.train.Server(cluster,
                           job_name=FLAGS.job_name,
                           task_index=FLAGS.task_index)

  if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":

    # Assigns ops to the local worker by default.
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster)):

      # Build model...
      loss = ...
      global_step = tf.Variable(0)

      train_op = tf.train.AdagradOptimizer(0.01).minimize(
          loss, global_step=global_step)

      saver = tf.train.Saver()
      summary_op = tf.merge_all_summaries()
      init_op = tf.initialize_all_variables()

    # Create a "supervisor", which oversees the training process.
    sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
                             logdir="/tmp/train_logs",
                             init_op=init_op,
                             summary_op=summary_op,
                             saver=saver,
                             global_step=global_step,
                             save_model_secs=600)

    # The supervisor takes care of session initialization, restoring from
    # a checkpoint, and closing when done or an error occurs.
    with sv.managed_session(server.target) as sess:
      # Loop until the supervisor shuts down or 1000000 steps have completed.
      step = 0
      while not sv.should_stop() and step < 1000000:
        # Run a training step asynchronously.
        # See `tf.train.SyncReplicasOptimizer` for additional details on how to
        # perform *synchronous* training.
        _, step = sess.run([train_op, global_step])

    # Ask for all the services to stop.
    sv.stop()

if __name__ == "__main__":
  tf.app.run()

1 个答案:

答案 0 :(得分:2)

这是一个非常令人困惑的例子(which came from the manual, I believe)。它可能会随着分布式TensorFlow的成熟而改变。

无论如何,&#34;工人&#34;和&#34; ps&#34;是任务(或作业而非任务的组合),所以他们真的没什么不同。不同之处在于它们应该用于什么。这个想法是状态(例如tf.device)应该在参数服务器上,而计算状态的操作应该在工作者上。不是通过在任何地方手动调用tf.Variable来实现此目的,而是使用名为tf.train.replica_device_setter的辅助函数将server.join()设备设置为参数服务器,将其他操作设置为工人。

with tf.device(tf.replica_device_setter( worker_device="/job:worker/task:%d" % FLAGS.task_index, cluster=cluster_spec)): v1 = tf.Variable(...) # Automatically assigned to a parameter server. train_op = ... # Automatically assigned to the worker. 只是意味着参数服务器将等待工作者,而不是立即终止其客户端进程。

{{1}}