Question

我根据“https://www.tensorflow.org/deploy/distributed”的示例实现了图形间复制和异步训练。

然后，我按如下方式设置了两个服务器和一个工作程序。

python dnn.py --ps_hosts = localhost：19000，localhost：18000 --worker_hosts = localhost：11000 --job_name = ps --task_index = 0

python dnn.py --ps_hosts = localhost：19000，localhost：18000 --worker_hosts = localhost：11000 --job_name = ps --task_index = 1

python dnn.py --ps_hosts = localhost：19000，localhost：18000 --worker_hosts = localhost：11000 --job_name = worker --task_index = 0

我对分布式张量流有三个问题。

首先，根据我的程序的张量流时间轴，如下所示，所有计算和变量更新操作 在 ps节点执行 ，而工作节点是空闲的。这对我来说很困惑，因为我认为计算操作应该在工作节点而不是ps节点上执行。有人会帮我这个吗？

distributed tensorflow timeline

其次，在我的程序中使用tf.train.replica_device_setter 仅将CPU分配给服务器。但是，操作在CPU和GPU上都执行。将CPU / GPU分配给服务器的正确方法是什么？

最后但并非最不重要，如果我启动两台服务器和三台工作人员，两台服务器会保存相同的参数副本吗？另外，我想知道这三个工人是否会更新同一图表的渐变。有人会告诉我吗？

P.S。我使用tf.train.replica_device_setter为设备分配了设备。但是，在示例（https://www.tensorflow.org/deploy/distributed）中，没有设备分配给本地服务器。就我而言，如果我没有将设备分配给本地服务器，则会出现如下错误：

“操作已明确分配给/ job：ps / task：0但可用设备为[/ job：localhost / replica：0 / task：0 / device：CPU：0，/ job：localhost / replica：0 /任务：0 /设备：GPU：0，/ job：localhost / replica：0 / task：0 / device：GPU：1 ...]。确保设备规范引用有效设备。“

我的代码：

def train():
tl = TimeLiner()
#get current servers
ps_hosts = FLAGS.ps_hosts.split(",")
#get current workers
worker_hosts = FLAGS.worker_hosts.split(",")
cluster = tf.train.ClusterSpec({"ps": ps_hosts,
                                "worker": worker_hosts})
graph_options = tf.GraphOptions(enable_bfloat16_sendrecv=True)
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3, allow_growth=True)
config = tf.ConfigProto(graph_options=graph_options, gpu_options=gpu_options, log_device_placement=False,
                        allow_soft_placement=False)
#start a server
server = tf.train.Server(cluster,
                         job_name=FLAGS.job_name,
                         task_index=FLAGS.task_index,
                         config=config)
if FLAGS.job_name == "ps":
    server.join()
elif FLAGS.job_name == "worker":
    with tf.variable_scope(tf.get_variable_scope()):
        with tf.device(tf.train.replica_device_setter(ps_device="/job:localhost/replica:0/task:%d/device:CPU:0" % FLAGS.task_index,
                                                      worker_device="/job:localhost/replica:0/task:%d/device:GPU:0" % FLAGS.task_index,
                                                      cluster=cluster)):
            loss = ...
            global_step = tf.train.get_or_create_global_step()
            train_op = tf.train.AdagradOptimizer(0.01).minimize(loss, global_step=global_step)
            sys.stdout.flush()
            init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
            summary_op = tf.summary.merge_all()
        hooks = [tf.train.StopAtStepHook(last_step=FLAGS.max_steps)]
        total_training = 0
        graph_options = tf.GraphOptions(enable_bfloat16_sendrecv=True)
        gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.9, allow_growth=True)
        config = tf.ConfigProto(graph_options=graph_options, gpu_options=gpu_options, log_device_placement=False,
                                allow_soft_placement=True)
        with tf.train.MonitoredTrainingSession(master=server.target,
                                               is_chief=(FLAGS.task_index == 0),
                                               checkpoint_dir=FLAGS.log_dir,
                                               log_step_count_steps=100000,
                                               hooks=hooks,
                                               config=config) as mon_sess:
            mon_sess.run(init_op)
            options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
            run_metadata = tf.RunMetadata()

            while not mon_sess.should_stop():
                # run a training step asynchronously
                [_, tot_loss, step, summary] = mon_sess.run([train_op, loss, global_step, summary_op],
                                                            options=options,
                                                            run_metadata=run_metadata)

                fetched_timeline = timeline.Timeline(run_metadata.step_stats)
                chrome_trace = fetched_timeline.generate_chrome_trace_format()
                tl.update_timeline(chrome_trace)
        tl.save('timeline.json')

提前致谢！

应

Answer 1

您可能听说过使用tf.Device（cpu：0）设置设备或类似您在开始会话之前设置的设备。试过吗？

Answer 2

可能是因为您将worker的标志索引放在了copy_device_setter的ps任务的索引中？

分布式张量流中的设备分配

2 个答案: