我在tensorflow 1.11.0中使用tf.Estimator.train_and_evaluate
和ParameterServerStrategy
分发策略。根据日志,分发服务器协调器以INDEPENDENT_WORKER模式运行。
我将整个培训数据(tfrecords)分为多个分区,每个培训工作者(包括主管)都有一个分区。问题是当工作人员之间的数据不平衡时,tensorflow的行为不符合我的预期。我画画以显示问题。
使用train_and_eval
时,训练停止条件为:1)max_step到达2)数据集到达末尾;评估程序停止条件为:global_step从检查点> = TrainSpec.max_step获取。在多人分布式培训中,如何正确设置max_step
?
代码:
train_distribute = ParameterServerStrategy(num_gpus_per_worker=FLAGS.gpu_per_worker)
eval_distribute = MirroredStrategy(num_gpus_per_worker=FLAGS.gpu_per_worker,
cross_tower_ops=cross_tower_ops_lib.AllReduceCrossTowerOps())
run_config = tf.estimator.RunConfig(
model_dir=FLAGS.model_dir,
save_checkpoints_secs=FLAGS.checkpoint_secs,
save_summary_steps=FLAGS.summary_steps,
keep_checkpoint_max=FLAGS.max_checkpoints,
train_distribute=train_distribute,
eval_distribute=eval_distribute)
estimator, train_spec, eval_spec = create_estimator_and_specs(run_config=run_config)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
每个节点中的TF_CONFIG:
chief node:
{u'cluster': {
u'ps': [u'ps:12615'],
u'chief': [u'chief:20396'],
u'worker': [u'worker1:18339', u'worker2:11609'],
u'evaluator': [u'evaluator:24352']},
u'task': {u'index': 0, u'type': u'chief'}
}
woker 0:
{u'cluster': {
u'ps': [u'ps:12615'],
u'chief': [u'chief:20396'],
u'worker': [u'worker1:18339', u'worker2:11609'],
u'evaluator': [u'evaluator:24352']},
u'task': {u'index': 0, u'type': u'worker'}
}
worker 1:
{u'cluster': {
u'ps': [u'ps:12615'],
u'chief': [u'chief:20396'],
u'worker': [u'worker1:18339', u'worker2:11609'],
u'evaluator': [u'evaluator:24352']},
u'task': {u'index': 1, u'type': u'worker'}
}
evaluator:
{u'cluster': {
u'ps': [u'ps:12615'],
u'chief': [u'chief:20396'],
u'worker': [u'worker1:18339', u'worker2:11609'],
u'evaluator': [u'evaluator:24352']},
u'task': {u'index': 0, u'type': u'evaluator'}
}