在分布式设置中运行train_and_eval()并保存多个检查点时,如何避免张量板摘要被覆盖?

时间:2018-10-27 12:31:01

标签: python tensorflow tensorboard tensorflow-estimator

当尝试保存汇总信息和评估指标时,在分布式环境中出现的信息太近了,我遇到了一个问题。

我一直试图更加熟悉在以分布式方式训练和评估模型时如何保存摘要。肯定有些事情我还没有适当地解决,但是据我了解,只有主控者向指定的model_dir写入摘要,我认为我发现我遇到的问题是主控器仍在运行先前的评估时触发另一个检查点和评估运行?

我看到有人提到可能与tf.estimator.RunConfig API docs有关,所以我将keep_checkpoint_max设置为较高的值,但这没有任何区别。

当我使用以下参数向gcloud ml-engine提交作业时:

  • 2 ps(n1-highmem-8)
  • 2个工人(n1-highmem-8)
  • keep_checkpoint_max=10
  • save_checkpoint_steps=50
  • save_summary_steps=20

我的train_and_eval()代码,

def train_and_eval(self, feature_provider, steps):
    self._send_dist_data_to_comet(feature_provider, ["train", "test"])
    self._initialize_estimator(feature_provider)
    train_spec = tf.estimator.TrainSpec(
        input_fn=lambda: self.train_input_fn(
            tfrecords=feature_provider.train_tfrecords, params=self.params
        ),
        max_steps=steps,
    )
    eval_spec = tf.estimator.EvalSpec(
        input_fn=lambda: self.eval_input_fn(
            tfrecords=feature_provider.test_tfrecords, params=self.params
        ),
        steps=None,
        throttle_secs=0,
    )
    tf.estimator.train_and_evaluate(
        estimator=self._estimator,
        train_spec=train_spec,
        eval_spec=eval_spec,
    )

我设置throttle_secs=0以确保不跳过任何评估,我设置steps=400,我的张量板摘要为:

Resulting summaries from tensorboard

所以我认为它正确地开始每20个步骤记录训练总结(也许第20步的总结被覆盖了,因为第一个是在步骤40),但是在第一个检查点之后基本上没有任何记录,直到训练完成并最后的检查点和评估运行。

我用于将指标记录到张量板的代码

def scalars(model, features, labels, spec, params):
    std_metrics = {
        "accuracy": tf.metrics.accuracy(
            labels=labels,
            predictions=spec.predictions["class_ids"],
            name="acc_op",
        ),
        "mpc_accuracy": tf.metrics.mean_per_class_accuracy(
            labels=labels,
            predictions=spec.predictions["class_ids"],
            num_classes=params["_n_out_classes"],
            name="mpc_acc_op",
        ),
        "auc": tf.metrics.auc(
            labels=tf.one_hot(indices=labels, depth=params["_n_out_classes"]),
            predictions=spec.predictions["probabilities"],
            name="auc_op",
        ),
    }
    if spec.mode == ModeKeys.EVAL:
        eval_metrics = spec.eval_metric_ops or {}
        eval_metrics.update(std_metrics)
        spec = spec._replace(eval_metric_ops=eval_metrics)
    else:
        tf.summary.scalar("loss", spec.loss)
        tf.summary.scalar("accuracy", std_metrics["accuracy"][1])
        tf.summary.scalar("auc", std_metrics["auc"][1])
    return spec

我正在项目中使用多个模型,因此我将上面的函数与装饰器一起使用,以将代码注入到不同的model_fns中。

装饰器看起来像这样,(不确定下面的代码有多少相关,但我想为了完整起见将其放入)

def attach(addons, modes=None, order="POST"):
    def decorator(model_fn):
        @wraps(model_fn)
        def wrapper(self, features, labels, mode, params):
            aux = self.aux_config # dict so i can enable/disable particular addons by name, in this case the addon would be 'scalars'
            targets = modes or ["train", "eval", "predict"]
            target_modes = [
                {
                    "train": ModeKeys.TRAIN,
                    "eval": ModeKeys.EVAL,
                    "predict": ModeKeys.PREDICT,
                }.get(trg_mode.lower())
                for trg_mode in targets
            ]
            applicable_addons = [
                addon
                for addon in addons
                if mode in target_modes and aux.get(addon.__name__, True)
            ] # addons can be added for specific modes only, in this case I attach 'scalars' for "TRAIN" and "EVAL"
            if order == "PRE": # this specifies whether to run the code in the addon before the model_fn or afterwards
                for add_on in applicable_addons:
                    add_on(self, features, labels, mode, params)
                spec = model_fn(self, features, labels, mode, params)
            elif order == "POST": #in the case of 'scalars' I first run the model_fn then pass the spec through the 'scalars' fn above
                spec = model_fn(self, features, labels, mode, params)
                for add_on in applicable_addons:
                    spec = add_on(self, features, labels, spec, params)
            return spec

        return wrapper

    return decorator

# I define a couple of partials based on the order of any functions i want to attach to a model_fn, 
# this is just to make code more legible
prepare = partial(attach, order="PRE") # if it goes before the model_fn it's 'preparing' something     
addon = partial(attach, order="POST") # if it goes after, it's an 'addon' to the model_fn

最后,在我的基本抽象模型类的_model_fn()中,返回特定模型的实际model_fn以及所有插件,看起来像是

@addon([scalars], ["TRAIN", "EVAL"])
def _model_fn(self, features, labels, mode, params):
    return self.model_fn(features, labels, mode, params)

当我在本地运行模型时,以上所有方法都很好用。我不确定这是我什至可以解决的问题,也许有某种钩子或某种方式可以使工人等待船长完成检查点再继续?

任何想法将不胜感激,如果我错过与该问题相关的任何内容,我可以共享代码的其他部分。

预先感谢您的帮助

0 个答案:

没有答案