Question

我在tensorflow 1.11.0中使用tf.Estimator.train_and_evaluate和ParameterServerStrategy分发策略。根据日志，分发服务器协调器以INDEPENDENT_WORKER模式运行。

我将整个培训数据（tfrecords）分为多个分区，每个培训工作者（包括主管）都有一个分区。问题是当工作人员之间的数据不平衡时，tensorflow的行为不符合我的预期。我画画以显示问题。

使用train_and_eval时，训练停止条件为：1）max_step到达2）数据集到达末尾；评估程序停止条件为：global_step从检查点> = TrainSpec.max_step获取。在多人分布式培训中，如何正确设置max_step？

代码：

train_distribute = ParameterServerStrategy(num_gpus_per_worker=FLAGS.gpu_per_worker)
eval_distribute = MirroredStrategy(num_gpus_per_worker=FLAGS.gpu_per_worker,
                                   cross_tower_ops=cross_tower_ops_lib.AllReduceCrossTowerOps())
run_config = tf.estimator.RunConfig(
    model_dir=FLAGS.model_dir,
    save_checkpoints_secs=FLAGS.checkpoint_secs,
    save_summary_steps=FLAGS.summary_steps,
    keep_checkpoint_max=FLAGS.max_checkpoints,
    train_distribute=train_distribute,
    eval_distribute=eval_distribute)

estimator, train_spec, eval_spec = create_estimator_and_specs(run_config=run_config)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

每个节点中的TF_CONFIG：

chief node:
 {u'cluster': {
   u'ps': [u'ps:12615'], 
   u'chief': [u'chief:20396'], 
   u'worker': [u'worker1:18339', u'worker2:11609'], 
   u'evaluator': [u'evaluator:24352']}, 
   u'task': {u'index': 0, u'type': u'chief'}
 }

 woker 0:
 {u'cluster': {
   u'ps': [u'ps:12615'], 
   u'chief': [u'chief:20396'], 
   u'worker': [u'worker1:18339', u'worker2:11609'], 
   u'evaluator': [u'evaluator:24352']}, 
   u'task': {u'index': 0, u'type': u'worker'}
 }

 worker 1:
 {u'cluster': {
   u'ps': [u'ps:12615'], 
   u'chief': [u'chief:20396'], 
   u'worker': [u'worker1:18339', u'worker2:11609'],
   u'evaluator': [u'evaluator:24352']}, 
   u'task': {u'index': 1, u'type': u'worker'}
 }

evaluator:
{u'cluster': {
  u'ps': [u'ps:12615'], 
  u'chief': [u'chief:20396'], 
  u'worker': [u'worker1:18339', u'worker2:11609'],
  u'evaluator': [u'evaluator:24352']}, 
  u'task': {u'index': 0, u'type': u'evaluator'}
}

Estimator API，培训数据在工人之间是不平衡的。团长可以先完成

0 个答案: