在学习了多gpu样本(multi_gpu_sample)之后,我认为分布张量流类似于multi-gpu,cpu to ps和gpu to worker。
因此,ps的工作是从工人那里收集渐变,然后更新参数并发送并分享给工人。
但在阅读下面的分发Tensorflow样本后,我感到困惑。 似乎ps只做join()ops。
如何理解这一点?谢谢!
import tensorflow as tf
# Flags for defining the tf.train.ClusterSpec
tf.app.flags.DEFINE_string("ps_hosts", "",
"Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts", "",
"Comma-separated list of hostname:port pairs")
# Flags for defining the tf.train.Server
tf.app.flags.DEFINE_string("job_name", "", "One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")
FLAGS = tf.app.flags.FLAGS
def main(_):
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")
# Create a cluster from the parameter server and worker hosts.
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
# Create and start a server for the local task.
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
# Assigns ops to the local worker by default.
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=cluster)):
# Build model...
loss = ...
global_step = tf.Variable(0)
train_op = tf.train.AdagradOptimizer(0.01).minimize(
loss, global_step=global_step)
saver = tf.train.Saver()
summary_op = tf.merge_all_summaries()
init_op = tf.initialize_all_variables()
# Create a "supervisor", which oversees the training process.
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
logdir="/tmp/train_logs",
init_op=init_op,
summary_op=summary_op,
saver=saver,
global_step=global_step,
save_model_secs=600)
# The supervisor takes care of session initialization, restoring from
# a checkpoint, and closing when done or an error occurs.
with sv.managed_session(server.target) as sess:
# Loop until the supervisor shuts down or 1000000 steps have completed.
step = 0
while not sv.should_stop() and step < 1000000:
# Run a training step asynchronously.
# See `tf.train.SyncReplicasOptimizer` for additional details on how to
# perform *synchronous* training.
_, step = sess.run([train_op, global_step])
# Ask for all the services to stop.
sv.stop()
if __name__ == "__main__":
tf.app.run()
答案 0 :(得分:2)
这是一个非常令人困惑的例子(which came from the manual, I believe)。它可能会随着分布式TensorFlow的成熟而改变。
无论如何,&#34;工人&#34;和&#34; ps&#34;是任务(或作业而非任务的组合),所以他们真的没什么不同。不同之处在于它们应该用于什么。这个想法是状态(例如tf.device
)应该在参数服务器上,而计算状态的操作应该在工作者上。不是通过在任何地方手动调用tf.Variable
来实现此目的,而是使用名为tf.train.replica_device_setter
的辅助函数将server.join()
设备设置为参数服务器,将其他操作设置为工人。
with tf.device(tf.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=cluster_spec)):
v1 = tf.Variable(...) # Automatically assigned to a parameter server.
train_op = ... # Automatically assigned to the worker.
只是意味着参数服务器将等待工作者,而不是立即终止其客户端进程。
{{1}}