Estimator API,培训数据在工人之间是不平衡的。团长可以先完成

时间:2019-01-18 10:42:52

标签: tensorflow

我在tensorflow 1.11.0中使用tf.Estimator.train_and_evaluateParameterServerStrategy分发策略。根据日志,分发服务器协调器以INDEPENDENT_WORKER模式运行。

enter image description here

我将整个培训数据(tfrecords)分为多个分区,每个培训工作者(包括主管)都有一个分区。问题是当工作人员之间的数据不平衡时,tensorflow的行为不符合我的预期。我画画以显示问题。

enter image description here

使用train_and_eval时,训练停止条件为:1)max_step到达2)数据集到达末尾;评估程序停止条件为:global_step从检查点> = TrainSpec.max_step获取。在多人分布式培训中,如何正确设置max_step

代码:

train_distribute = ParameterServerStrategy(num_gpus_per_worker=FLAGS.gpu_per_worker)
eval_distribute = MirroredStrategy(num_gpus_per_worker=FLAGS.gpu_per_worker,
                                   cross_tower_ops=cross_tower_ops_lib.AllReduceCrossTowerOps())
run_config = tf.estimator.RunConfig(
    model_dir=FLAGS.model_dir,
    save_checkpoints_secs=FLAGS.checkpoint_secs,
    save_summary_steps=FLAGS.summary_steps,
    keep_checkpoint_max=FLAGS.max_checkpoints,
    train_distribute=train_distribute,
    eval_distribute=eval_distribute)

estimator, train_spec, eval_spec = create_estimator_and_specs(run_config=run_config)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

每个节点中的TF_CONFIG:

chief node:
 {u'cluster': {
   u'ps': [u'ps:12615'], 
   u'chief': [u'chief:20396'], 
   u'worker': [u'worker1:18339', u'worker2:11609'], 
   u'evaluator': [u'evaluator:24352']}, 
   u'task': {u'index': 0, u'type': u'chief'}
 }

 woker 0:
 {u'cluster': {
   u'ps': [u'ps:12615'], 
   u'chief': [u'chief:20396'], 
   u'worker': [u'worker1:18339', u'worker2:11609'], 
   u'evaluator': [u'evaluator:24352']}, 
   u'task': {u'index': 0, u'type': u'worker'}
 }

 worker 1:
 {u'cluster': {
   u'ps': [u'ps:12615'], 
   u'chief': [u'chief:20396'], 
   u'worker': [u'worker1:18339', u'worker2:11609'],
   u'evaluator': [u'evaluator:24352']}, 
   u'task': {u'index': 1, u'type': u'worker'}
 }

evaluator:
{u'cluster': {
  u'ps': [u'ps:12615'], 
  u'chief': [u'chief:20396'], 
  u'worker': [u'worker1:18339', u'worker2:11609'],
  u'evaluator': [u'evaluator:24352']}, 
  u'task': {u'index': 0, u'type': u'evaluator'}
}

0 个答案:

没有答案