Tensorflow分布式培训未正确评估模型

时间:2019-08-28 10:37:57

标签: python tensorflow tensorflow-estimator distributed-tensorflow

我正在使用参数服务器策略在Tensorflow中运行异步分布式培训。

将多个CPU上的多工作程序作为评估器作为单独的节点。

参数服务器的tf_config示例:其他TF_CONFIG上的负责人,工作人员和评估人员的索引和类型各不相同。

TF_CONFIG={
"task": {
    "type": "ps",
    "index": 0
},
"cluster": {
    "chief": ["machine2:2222"],
    "worker": ["machine3:2223","machine4:2224"],
    "evaluator": ["machine5:2225"],
    "ps": ["machine1:2218"]
}

}

def main(unused_argv):
flags.mark_flag_as_required('model_dir')
flags.mark_flag_as_required('pipeline_config_path')
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir)

train_and_eval_dict = model_lib.create_estimator_and_inputs(
    run_config=config,
    hparams=model_hparams.create_hparams(FLAGS.hparams_overrides),
    pipeline_config_path=FLAGS.pipeline_config_path,
    train_steps=FLAGS.num_train_steps,
    sample_1_of_n_eval_examples=FLAGS.sample_1_of_n_eval_examples,
    sample_1_of_n_eval_on_train_examples=(
        FLAGS.sample_1_of_n_eval_on_train_examples))
estimator = train_and_eval_dict['estimator']
train_input_fn = train_and_eval_dict['train_input_fn']
eval_input_fns = train_and_eval_dict['eval_input_fns']
eval_on_train_input_fn = train_and_eval_dict['eval_on_train_input_fn']
predict_input_fn = train_and_eval_dict['predict_input_fn']
train_steps = train_and_eval_dict['train_steps']

if FLAGS.checkpoint_dir:
    if FLAGS.eval_training_data:
        name = 'training_data'
        input_fn = eval_on_train_input_fn
    else:
        name = 'validation_data'

    # The first eval input will be evaluated.
    input_fn = eval_input_fns[0]

    if FLAGS.run_once:
        estimator.evaluate(input_fn,
            num_eval_steps=None,
            checkpoint_path=tf.train.latest_checkpoint(
                FLAGS.checkpoint_dir))
    else:
        model_lib.continuous_eval(estimator, FLAGS.checkpoint_dir, input_fn, train_steps, name)
else:
    train_spec, eval_specs = model_lib.create_train_and_eval_specs(
        train_input_fn,
        eval_input_fns,
        eval_on_train_input_fn,
        predict_input_fn,
        train_steps,
        eval_on_train_data=False)

  # Currently only a single Eval Spec is allowed.
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])


if __name__ == '__main__':
    tf.app.run()

我收到多个警告:

  

W0828 00:03:55.229441 140490069309248 estimator.py:1924]估算器的model_fn(位于0x7fc5da9b5268>的.model_fn)包含params参数,但不会将参数传递给Estimator。

     

./ tensorflow / core / grappler / optimizers / graph_optimizer_stage.h:241]运行优化器ArithmeticOptimizer失败,转移RemoveStackStridedSliceSameAxis节点Preprocessor / ResizeToRange / strided_slice_3。错误:打包节点(Preprocessor / ResizeToRange / stack_2)轴属性超出范围:0


但是培训进行得很好,并且进行了评估。但是我的评估结果一直都是0。

  

创建索引...   索引已创建!   创建索引...   索引已创建!   按图片运行评估...   评估注释类型 bbox   完成(t = 1.66s)。   正在累积评估结果...   完成(t = 0.52s)。

     

平均精度(AP)@ [IoU = 0.50:0.95 |面积=全部| maxDets = 100] = 0.000

     

平均精度(AP)@ [IoU = 0.50 |面积=全部| maxDets = 100] = 0.000

     

平均精度(AP)@ [IoU = 0.75 |面积=全部| maxDets = 100] = 0.000

     

平均精度(AP)@ [IoU = 0.50:0.95 |面积=小| maxDets = 100] = 0.000

     

平均精度(AP)@ [IoU = 0.50:0.95 | area = medium | maxDets = 100] = 0.000

     

平均精度(AP)@ [IoU = 0.50:0.95 |面积=大| maxDets = 100] = 0.000

     

平均召回率(AR)@ [IoU = 0.50:0.95 |面积=全部| maxDets = 1] = 0.000

     

平均召回率(AR)@ [IoU = 0.50:0.95 |面积=全部| maxDets = 10] = 0.000

     

平均召回率(AR)@ [IoU = 0.50:0.95 |面积=全部| maxDets = 100] = 0.000

     

平均召回率(AR)@ [IoU = 0.50:0.95 |面积=小| maxDets = 100] = 0.000

     

平均召回率(AR)@ [IoU = 0.50:0.95 | area = medium | maxDets = 100] = 0.000

     

平均召回率(AR)@ [IoU = 0.50:0.95 |面积=大| maxDets = 100] = 0.000

任何帮助将不胜感激。预先感谢。

0 个答案:

没有答案