在分布式tensorFlow(1.0.1)上,当使用syncReplica和MonitoredTrainingSession时,主要工作人员在训练结束时挂起

时间:2017-05-02 21:04:12

标签: tensorflow

在分布式tensorFlow(1.0.1)上,使用syncReplica和MonitoredTrainingSession时,首席工作人员在训练结束时挂起

需要帮助才能了解我所遗漏的内容。另外,如果您需要更多信息,请与我们联系

提前致谢

ClusterConfig:

PS的数量:2 工人人数:2

输出:

WORKER_0:

INFO:train_opt:Sync Replica Optimizer Enabled...  
INFO:train_opt:[1] Training begins @ 1493747578.942078  
INFO:train_opt:[1] worker/0 1493747581.577683: training step 0 done with Loss 3476.279060  
INFO:train_opt:[1] worker/0 1493747584.819320: training step 200 done with Loss 220.282581  
INFO:train_opt:[1] worker/0 1493747587.935895: training step 400 done with Loss 38.253779  
INFO:train_opt:[1] worker/0 1493747590.975302: training step 600 done with Loss 20.162405  <=== Hangs by end of training  

WORKER_1:

INFO:train_opt:Using Train Optimizer: Adam  
INFO:train_opt:Sync Replica Optimizer Enabled...  
INFO:train_opt:[1] Training begins @ 1493747578.956051  
INFO:train_opt:[1] worker/1 1493747581.531765: training step 0 done with Loss 3476.279060  
INFO:train_opt:[1] worker/1 1493747585.027504: training step 200 done with Loss 196.834690  
INFO:train_opt:[1] worker/1 1493747588.469242: training step 400 done with Loss 31.045701  
INFO:train_opt:[1] worker/1 1493747591.898919: training step 600 done with Loss 16.355974  
INFO:train_opt:[1] Training ends @ 1493747612.044738  
INFO:train_opt:[1] Training elapsed time: 33.088687 s  
INFO:train_opt:FINAL Training Loss:11.364212  <==== Training completed on this worker!!  

cluster = tf.train.ClusterSpec({ "ps": ps_spec, "worker": worker_spec})

server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index, protocol="grpc")

sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False, device_filters=["/job:ps", "/job:worker/task:%d" % FLAGS.task_index])

if FLAGS.job_name == "ps":
    server.join()

elif FLAGS.job_name == "worker":
    is_chief = (FLAGS.task_index == 0)
    if FLAGS.num_gpus > 0:
        if FLAGS.num_gpus < num_workers:
            raise ValueError("number of gpus is less than number of workers")
        # Avoid gpu allocation conflict: now allocate task_num -> #gpu
        # for each worker in the corresponding machine
        gpu = (FLAGS.task_index % FLAGS.num_gpus)
        worker_device = "/job:worker/task:%d/gpu:%d" % (FLAGS.task_index, gpu)
    elif FLAGS.num_gpus == 0:
        # Just allocate the CPU to worker server
        cpu = 0
        worker_device = "/job:worker/task:%d/cpu:%d" % (FLAGS.task_index, cpu)

    # The device setter will automatically place Variables ops on separate
    # parameter servers (ps). The non-Variable ops will be placed on the workers.
    # The ps use CPU and workers use corresponding GPU
    with tf.device( tf.train.replica_device_setter(worker_device=worker_device, ps_device="/job:ps/cpu:0", cluster=cluster)):

        # ...build regressor model
        loss = ...
        opt = tf.train.AdamOptimizer(learning_rate=0.01)

        # Between the graph replication. If enabled training happens *syncronously*
        if FLAGS.sync_replicas == True:
            worker_spec = FLAGS.worker_hosts.split(",")
            # Get the number of workers.
            num_workers = len(worker_spec)

            if FLAGS.replicas_to_aggregate is None:
                replicas_to_aggregate = num_workers
            else:
                replicas_to_aggregate = FLAGS.replicas_to_aggregate

            opt = tf.train.SyncReplicasOptimizer(opt, replicas_to_aggregate=replicas_to_aggregate, total_num_replicas=num_workers, name="nn_sync_replicas")
        train_step = opt.minimize(loss, global_step=global_step)

        if FLAGS.sync_replicas == True:
            # You can create the hook which handles initialization and queues.
            sync_replicas_hook = opt.make_session_run_hook(is_chief=is_chief, num_tokens=num_workers)

    if FLAGS.sync_replicas == True:
        hooks = [sync_replicas_hook, tf.train.StopAtStepHook(last_step=1000)]
    else:
        hooks = [tf.train.StopAtStepHook(last_step=1000)]

    # The MonitoredTrainingSession takes care of session initialization,
    # restoring from a checkpoint, saving to a checkpoint, and closing when done
    # or an error occurs.
    with tf.train.MonitoredTrainingSession(master=server.target, is_chief=is_chief, hooks=hooks, config=sess_config) as sess:
        while not sess.should_stop():
            # run tensorflow distributed session to compute loss function
            _, loss, = self.mon_sess.run([train_step, loss, ], feed_dict={self.input_features: X_train.transpose(), self.target_output: Y_train})

1 个答案:

答案 0 :(得分:1)

只有主管将通过主队列运行器更新变量,但它应该使用所有可用工作人员的毕业生。主人应该等到收集到足够的毕业生,所以不一定是所有工人。

replicas_to_aggregate = num_workers主任等待所有工人的毕业生时。

在你的情况下,当训练在worker_1上完成时,worker_0(酋长)挂起期待来自worker_1的毕业。

您可以通过设置replicas_to_aggregate = 1来解决此问题。但是,当所有工作人员都在运行时,我不确定这是否会聚集所有工人的所有毕业生。