我正在使用参数服务器策略在Tensorflow中运行异步分布式培训。
将多个CPU上的多工作程序作为评估器作为单独的节点。参数服务器的tf_config示例:其他TF_CONFIG上的负责人,工作人员和评估人员的索引和类型各不相同。
TF_CONFIG={
"task": {
"type": "ps",
"index": 0
},
"cluster": {
"chief": ["machine2:2222"],
"worker": ["machine3:2223","machine4:2224"],
"evaluator": ["machine5:2225"],
"ps": ["machine1:2218"]
}
}
def main(unused_argv):
flags.mark_flag_as_required('model_dir')
flags.mark_flag_as_required('pipeline_config_path')
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir)
train_and_eval_dict = model_lib.create_estimator_and_inputs(
run_config=config,
hparams=model_hparams.create_hparams(FLAGS.hparams_overrides),
pipeline_config_path=FLAGS.pipeline_config_path,
train_steps=FLAGS.num_train_steps,
sample_1_of_n_eval_examples=FLAGS.sample_1_of_n_eval_examples,
sample_1_of_n_eval_on_train_examples=(
FLAGS.sample_1_of_n_eval_on_train_examples))
estimator = train_and_eval_dict['estimator']
train_input_fn = train_and_eval_dict['train_input_fn']
eval_input_fns = train_and_eval_dict['eval_input_fns']
eval_on_train_input_fn = train_and_eval_dict['eval_on_train_input_fn']
predict_input_fn = train_and_eval_dict['predict_input_fn']
train_steps = train_and_eval_dict['train_steps']
if FLAGS.checkpoint_dir:
if FLAGS.eval_training_data:
name = 'training_data'
input_fn = eval_on_train_input_fn
else:
name = 'validation_data'
# The first eval input will be evaluated.
input_fn = eval_input_fns[0]
if FLAGS.run_once:
estimator.evaluate(input_fn,
num_eval_steps=None,
checkpoint_path=tf.train.latest_checkpoint(
FLAGS.checkpoint_dir))
else:
model_lib.continuous_eval(estimator, FLAGS.checkpoint_dir, input_fn, train_steps, name)
else:
train_spec, eval_specs = model_lib.create_train_and_eval_specs(
train_input_fn,
eval_input_fns,
eval_on_train_input_fn,
predict_input_fn,
train_steps,
eval_on_train_data=False)
# Currently only a single Eval Spec is allowed.
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
if __name__ == '__main__':
tf.app.run()
我收到多个警告:
W0828 00:03:55.229441 140490069309248 estimator.py:1924]估算器的model_fn(位于0x7fc5da9b5268>的.model_fn)包含params参数,但不会将参数传递给Estimator。
./ tensorflow / core / grappler / optimizers / graph_optimizer_stage.h:241]运行优化器ArithmeticOptimizer失败,转移RemoveStackStridedSliceSameAxis节点Preprocessor / ResizeToRange / strided_slice_3。错误:打包节点(Preprocessor / ResizeToRange / stack_2)轴属性超出范围:0
但是培训进行得很好,并且进行了评估。但是我的评估结果一直都是0。
创建索引... 索引已创建! 创建索引... 索引已创建! 按图片运行评估... 评估注释类型 bbox 完成(t = 1.66s)。 正在累积评估结果... 完成(t = 0.52s)。
平均精度(AP)@ [IoU = 0.50:0.95 |面积=全部| maxDets = 100] = 0.000
平均精度(AP)@ [IoU = 0.50 |面积=全部| maxDets = 100] = 0.000
平均精度(AP)@ [IoU = 0.75 |面积=全部| maxDets = 100] = 0.000
平均精度(AP)@ [IoU = 0.50:0.95 |面积=小| maxDets = 100] = 0.000
平均精度(AP)@ [IoU = 0.50:0.95 | area = medium | maxDets = 100] = 0.000
平均精度(AP)@ [IoU = 0.50:0.95 |面积=大| maxDets = 100] = 0.000
平均召回率(AR)@ [IoU = 0.50:0.95 |面积=全部| maxDets = 1] = 0.000
平均召回率(AR)@ [IoU = 0.50:0.95 |面积=全部| maxDets = 10] = 0.000
平均召回率(AR)@ [IoU = 0.50:0.95 |面积=全部| maxDets = 100] = 0.000
平均召回率(AR)@ [IoU = 0.50:0.95 |面积=小| maxDets = 100] = 0.000
平均召回率(AR)@ [IoU = 0.50:0.95 | area = medium | maxDets = 100] = 0.000
平均召回率(AR)@ [IoU = 0.50:0.95 |面积=大| maxDets = 100] = 0.000
任何帮助将不胜感激。预先感谢。