分布式张量流中的设备分配

时间:2018-06-09 08:26:41

标签: tensorflow distributed

我根据“https://www.tensorflow.org/deploy/distributed”的示例实现了图形间复制和异步训练。

然后,我按如下方式设置了两个服务器和一个工作程序。

python dnn.py --ps_hosts = localhost:19000,localhost:18000 --worker_hosts = localhost:11000 --job_name = ps --task_index = 0

python dnn.py --ps_hosts = localhost:19000,localhost:18000 --worker_hosts = localhost:11000 --job_name = ps --task_index = 1

python dnn.py --ps_hosts = localhost:19000,localhost:18000 --worker_hosts = localhost:11000 --job_name = worker --task_index = 0

我对分布式张量流有三个问题。

首先,根据我的程序的张量流时间轴,如下所示,所有计算和变量更新操作 ps节点执行 ,而工作节点是空闲的。这对我来说很困惑,因为我认为计算操作应该在工作节点而不是ps节点上执行。有人会帮我这个吗?

distributed tensorflow timeline

其次,在我的程序中使用tf.train.replica_device_setter 仅将CPU分配给服务器。但是,操作在CPU和GPU上都执行。将CPU / GPU分配给服务器的正确方法是什么?

最后但并非最不重要,如果我启动两台服务器和三台工作人员,两台服务器会保存相同的参数副本吗?另外,我想知道这三个工人是否会更新同一图表的渐变。有人会告诉我吗?

P.S。 我使用tf.train.replica_device_setter为设备分配了设备。但是,在示例(https://www.tensorflow.org/deploy/distributed)中,没有设备分配给本地服务器。就我而言,如果我没有将设备分配给本地服务器,则会出现如下错误:

“操作已明确分配给/ job:ps / task:0但可用设备为[/ job:localhost / replica:0 / task:0 / device:CPU:0,/ job:localhost / replica:0 /任务:0 /设备:GPU:0,/ job:localhost / replica:0 / task:0 / device:GPU:1 ...]。确保设备规范引用有效设备。“

我的代码:

def train():
tl = TimeLiner()
#get current servers
ps_hosts = FLAGS.ps_hosts.split(",")
#get current workers
worker_hosts = FLAGS.worker_hosts.split(",")
cluster = tf.train.ClusterSpec({"ps": ps_hosts,
                                "worker": worker_hosts})
graph_options = tf.GraphOptions(enable_bfloat16_sendrecv=True)
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3, allow_growth=True)
config = tf.ConfigProto(graph_options=graph_options, gpu_options=gpu_options, log_device_placement=False,
                        allow_soft_placement=False)
#start a server
server = tf.train.Server(cluster,
                         job_name=FLAGS.job_name,
                         task_index=FLAGS.task_index,
                         config=config)
if FLAGS.job_name == "ps":
    server.join()
elif FLAGS.job_name == "worker":
    with tf.variable_scope(tf.get_variable_scope()):
        with tf.device(tf.train.replica_device_setter(ps_device="/job:localhost/replica:0/task:%d/device:CPU:0" % FLAGS.task_index,
                                                      worker_device="/job:localhost/replica:0/task:%d/device:GPU:0" % FLAGS.task_index,
                                                      cluster=cluster)):
            loss = ...
            global_step = tf.train.get_or_create_global_step()
            train_op = tf.train.AdagradOptimizer(0.01).minimize(loss, global_step=global_step)
            sys.stdout.flush()
            init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
            summary_op = tf.summary.merge_all()
        hooks = [tf.train.StopAtStepHook(last_step=FLAGS.max_steps)]
        total_training = 0
        graph_options = tf.GraphOptions(enable_bfloat16_sendrecv=True)
        gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.9, allow_growth=True)
        config = tf.ConfigProto(graph_options=graph_options, gpu_options=gpu_options, log_device_placement=False,
                                allow_soft_placement=True)
        with tf.train.MonitoredTrainingSession(master=server.target,
                                               is_chief=(FLAGS.task_index == 0),
                                               checkpoint_dir=FLAGS.log_dir,
                                               log_step_count_steps=100000,
                                               hooks=hooks,
                                               config=config) as mon_sess:
            mon_sess.run(init_op)
            options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
            run_metadata = tf.RunMetadata()

            while not mon_sess.should_stop():
                # run a training step asynchronously
                [_, tot_loss, step, summary] = mon_sess.run([train_op, loss, global_step, summary_op],
                                                            options=options,
                                                            run_metadata=run_metadata)

                fetched_timeline = timeline.Timeline(run_metadata.step_stats)
                chrome_trace = fetched_timeline.generate_chrome_trace_format()
                tl.update_timeline(chrome_trace)
        tl.save('timeline.json')

提前致谢!

2 个答案:

答案 0 :(得分:0)

您可能听说过使用tf.Device(cpu:0)设置设备或类似您在开始会话之前设置的设备。试过吗?

答案 1 :(得分:0)

可能是因为您将worker的标志索引放在了copy_device_setter的ps任务的索引中?