我对分布式计算世界有点新鲜。我正在阅读官方tensorflow教程中的following,但我对本教程主要示例中的内容感到非常困惑。
特别是,ps工作和工人如何互动? ps工作的作用究竟是什么?他们在代码中的相应部分是非常有限的,他们似乎没有做太多,所以他们的目的是什么?我想我不明白我们的分布式系统的各个部分如何协同工作。
如果有人能够解释在最后根据不同进程及其操作执行shell命令时发生的事情,那将是很棒的。
以下是参考的主要代码:
import argparse
import sys
import tensorflow as tf
FLAGS = None
def main(_):
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")
# Create a cluster from the parameter server and worker hosts.
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
# Create and start a server for the local task.
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
# Assigns ops to the local worker by default.
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=cluster)):
# Build model...
loss = ...
global_step = tf.contrib.framework.get_or_create_global_step()
train_op = tf.train.AdagradOptimizer(0.01).minimize(
loss, global_step=global_step)
# The StopAtStepHook handles stopping after running given steps.
hooks=[tf.train.StopAtStepHook(last_step=1000000)]
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=(FLAGS.task_index == 0),
checkpoint_dir="/tmp/train_logs",
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Run a training step asynchronously.
# See `tf.train.SyncReplicasOptimizer` for additional details on how to
# perform *synchronous* training.
# mon_sess.run handles AbortedError in case of preempted PS.
mon_sess.run(train_op)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.register("type", "bool", lambda v: v.lower() == "true")
# Flags for defining the tf.train.ClusterSpec
parser.add_argument(
"--ps_hosts",
type=str,
default="",
help="Comma-separated list of hostname:port pairs"
)
parser.add_argument(
"--worker_hosts",
type=str,
default="",
help="Comma-separated list of hostname:port pairs"
)
parser.add_argument(
"--job_name",
type=str,
default="",
help="One of 'ps', 'worker'"
)
# Flags for defining the tf.train.Server
parser.add_argument(
"--task_index",
type=int,
default=0,
help="Index of task within the job"
)
FLAGS, unparsed = parser.parse_known_args()
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
这是shell命令:
$ python trainer.py\
--ps_hosts = ps0.example.com: 2222, ps1.example.com: 2222\
--worker_hosts = worker0.example.com: 2222, worker1.example.com: 2222\
--job_name = ps--task_index = 0# On ps1.example.com:
$ python trainer.py\
--ps_hosts = ps0.example.com: 2222, ps1.example.com: 2222\
--worker_hosts = worker0.example.com: 2222, worker1.example.com: 2222\
--job_name = ps--task_index = 1# On worker0.example.com:
$ python trainer.py\
--ps_hosts = ps0.example.com: 2222, ps1.example.com: 2222\
--worker_hosts = worker0.example.com: 2222, worker1.example.com: 2222\
--job_name = worker--task_index = 0# On worker1.example.com:
$ python trainer.py\
--ps_hosts = ps0.example.com: 2222, ps1.example.com: 2222\
--worker_hosts = worker0.example.com: 2222, worker1.example.com: 2222\
--job_name = worker--task_index = 1
答案 0 :(得分:2)
这是情况的示意图。您有4个张量流程。每个进程都运行TensorFlow工作线程,该线程可以执行TensorFlow计算。此外,其中两个进程还运行一个客户端线程,该线程发出session.run
个请求。
每个工人流程也是一个"设备"在TensorFlow中,用于在设备上拆分图形执行。您可以通过在图形构造期间执行with tf.device("job:worker/task:0"):
之类的操作来告诉TF运行时在worker1设备上执行图形的某些部分。
在tf.train.replica_device_setter
中发生的神奇事件取代了手动with tf.device
注释,并具有跨设备自动分配变量的效果。更具体地说,当你有两个PS分片时,一半的变量将进入ps1设备,另一半进入ps2设备。同时,更新这些变量的图形部分将在每个工作设备上复制。
如果您使用手动设备规范替换replica_device_setter
,那么您的工作流程将大致如下所示
with tf.device('ps1'):
var1 = tf.Variable(...)
with tf.device('ps2'):
var2 = tf.Variable(...)
with tf.device('worker1'):
update_op1 = var1.assign_add(grad1)
update_op2 = var2.assign_add(grad2)
while True:
sess.run([update_op1, update_op2])
自动处理通信。在worker1客户端线程中执行sess.run(update_op1)
时,它将在worker1上计算grad1
,然后将结果发送到ps1任务,并触发ps1工作线程更新其值var1
答案 1 :(得分:1)
根据我的理解,ps作业包含不同任务之间的所有共享数据,可以在不同的机器上运行(并且所有共享相同的ps作业)。