在分布式tensorFlow(1.0.1)上,使用syncReplica和MonitoredTrainingSession时,首席工作人员在训练结束时挂起
需要帮助才能了解我所遗漏的内容。另外,如果您需要更多信息,请与我们联系
提前致谢
ClusterConfig:
PS的数量:2 工人人数:2
输出:
WORKER_0:
INFO:train_opt:Sync Replica Optimizer Enabled...
INFO:train_opt:[1] Training begins @ 1493747578.942078
INFO:train_opt:[1] worker/0 1493747581.577683: training step 0 done with Loss 3476.279060
INFO:train_opt:[1] worker/0 1493747584.819320: training step 200 done with Loss 220.282581
INFO:train_opt:[1] worker/0 1493747587.935895: training step 400 done with Loss 38.253779
INFO:train_opt:[1] worker/0 1493747590.975302: training step 600 done with Loss 20.162405 <=== Hangs by end of training
WORKER_1:
INFO:train_opt:Using Train Optimizer: Adam
INFO:train_opt:Sync Replica Optimizer Enabled...
INFO:train_opt:[1] Training begins @ 1493747578.956051
INFO:train_opt:[1] worker/1 1493747581.531765: training step 0 done with Loss 3476.279060
INFO:train_opt:[1] worker/1 1493747585.027504: training step 200 done with Loss 196.834690
INFO:train_opt:[1] worker/1 1493747588.469242: training step 400 done with Loss 31.045701
INFO:train_opt:[1] worker/1 1493747591.898919: training step 600 done with Loss 16.355974
INFO:train_opt:[1] Training ends @ 1493747612.044738
INFO:train_opt:[1] Training elapsed time: 33.088687 s
INFO:train_opt:FINAL Training Loss:11.364212 <==== Training completed on this worker!!
cluster = tf.train.ClusterSpec({ "ps": ps_spec, "worker": worker_spec})
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index, protocol="grpc")
sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False, device_filters=["/job:ps", "/job:worker/task:%d" % FLAGS.task_index])
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
is_chief = (FLAGS.task_index == 0)
if FLAGS.num_gpus > 0:
if FLAGS.num_gpus < num_workers:
raise ValueError("number of gpus is less than number of workers")
# Avoid gpu allocation conflict: now allocate task_num -> #gpu
# for each worker in the corresponding machine
gpu = (FLAGS.task_index % FLAGS.num_gpus)
worker_device = "/job:worker/task:%d/gpu:%d" % (FLAGS.task_index, gpu)
elif FLAGS.num_gpus == 0:
# Just allocate the CPU to worker server
cpu = 0
worker_device = "/job:worker/task:%d/cpu:%d" % (FLAGS.task_index, cpu)
# The device setter will automatically place Variables ops on separate
# parameter servers (ps). The non-Variable ops will be placed on the workers.
# The ps use CPU and workers use corresponding GPU
with tf.device( tf.train.replica_device_setter(worker_device=worker_device, ps_device="/job:ps/cpu:0", cluster=cluster)):
# ...build regressor model
loss = ...
opt = tf.train.AdamOptimizer(learning_rate=0.01)
# Between the graph replication. If enabled training happens *syncronously*
if FLAGS.sync_replicas == True:
worker_spec = FLAGS.worker_hosts.split(",")
# Get the number of workers.
num_workers = len(worker_spec)
if FLAGS.replicas_to_aggregate is None:
replicas_to_aggregate = num_workers
else:
replicas_to_aggregate = FLAGS.replicas_to_aggregate
opt = tf.train.SyncReplicasOptimizer(opt, replicas_to_aggregate=replicas_to_aggregate, total_num_replicas=num_workers, name="nn_sync_replicas")
train_step = opt.minimize(loss, global_step=global_step)
if FLAGS.sync_replicas == True:
# You can create the hook which handles initialization and queues.
sync_replicas_hook = opt.make_session_run_hook(is_chief=is_chief, num_tokens=num_workers)
if FLAGS.sync_replicas == True:
hooks = [sync_replicas_hook, tf.train.StopAtStepHook(last_step=1000)]
else:
hooks = [tf.train.StopAtStepHook(last_step=1000)]
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(master=server.target, is_chief=is_chief, hooks=hooks, config=sess_config) as sess:
while not sess.should_stop():
# run tensorflow distributed session to compute loss function
_, loss, = self.mon_sess.run([train_step, loss, ], feed_dict={self.input_features: X_train.transpose(), self.target_output: Y_train})
答案 0 :(得分:1)
只有主管将通过主队列运行器更新变量,但它应该使用所有可用工作人员的毕业生。主人应该等到收集到足够的毕业生,所以不一定是所有工人。
当replicas_to_aggregate = num_workers
主任等待所有工人的毕业生时。
在你的情况下,当训练在worker_1上完成时,worker_0(酋长)挂起期待来自worker_1的毕业。
您可以通过设置replicas_to_aggregate = 1
来解决此问题。但是,当所有工作人员都在运行时,我不确定这是否会聚集所有工人的所有毕业生。