当尝试保存汇总信息和评估指标时,在分布式环境中出现的信息太近了,我遇到了一个问题。
我一直试图更加熟悉在以分布式方式训练和评估模型时如何保存摘要。肯定有些事情我还没有适当地解决,但是据我了解,只有主控者向指定的model_dir写入摘要,我认为我发现我遇到的问题是主控器仍在运行先前的评估时触发另一个检查点和评估运行?
我看到有人提到可能与tf.estimator.RunConfig API docs有关,所以我将keep_checkpoint_max
设置为较高的值,但这没有任何区别。
当我使用以下参数向gcloud ml-engine提交作业时:
keep_checkpoint_max=10
save_checkpoint_steps=50
save_summary_steps=20
我的train_and_eval()
代码,
def train_and_eval(self, feature_provider, steps):
self._send_dist_data_to_comet(feature_provider, ["train", "test"])
self._initialize_estimator(feature_provider)
train_spec = tf.estimator.TrainSpec(
input_fn=lambda: self.train_input_fn(
tfrecords=feature_provider.train_tfrecords, params=self.params
),
max_steps=steps,
)
eval_spec = tf.estimator.EvalSpec(
input_fn=lambda: self.eval_input_fn(
tfrecords=feature_provider.test_tfrecords, params=self.params
),
steps=None,
throttle_secs=0,
)
tf.estimator.train_and_evaluate(
estimator=self._estimator,
train_spec=train_spec,
eval_spec=eval_spec,
)
我设置throttle_secs=0
以确保不跳过任何评估,我设置steps=400
,我的张量板摘要为:
所以我认为它正确地开始每20个步骤记录训练总结(也许第20步的总结被覆盖了,因为第一个是在步骤40),但是在第一个检查点之后基本上没有任何记录,直到训练完成并最后的检查点和评估运行。
我用于将指标记录到张量板的代码
def scalars(model, features, labels, spec, params):
std_metrics = {
"accuracy": tf.metrics.accuracy(
labels=labels,
predictions=spec.predictions["class_ids"],
name="acc_op",
),
"mpc_accuracy": tf.metrics.mean_per_class_accuracy(
labels=labels,
predictions=spec.predictions["class_ids"],
num_classes=params["_n_out_classes"],
name="mpc_acc_op",
),
"auc": tf.metrics.auc(
labels=tf.one_hot(indices=labels, depth=params["_n_out_classes"]),
predictions=spec.predictions["probabilities"],
name="auc_op",
),
}
if spec.mode == ModeKeys.EVAL:
eval_metrics = spec.eval_metric_ops or {}
eval_metrics.update(std_metrics)
spec = spec._replace(eval_metric_ops=eval_metrics)
else:
tf.summary.scalar("loss", spec.loss)
tf.summary.scalar("accuracy", std_metrics["accuracy"][1])
tf.summary.scalar("auc", std_metrics["auc"][1])
return spec
我正在项目中使用多个模型,因此我将上面的函数与装饰器一起使用,以将代码注入到不同的model_fns
中。
装饰器看起来像这样,(不确定下面的代码有多少相关,但我想为了完整起见将其放入)
def attach(addons, modes=None, order="POST"):
def decorator(model_fn):
@wraps(model_fn)
def wrapper(self, features, labels, mode, params):
aux = self.aux_config # dict so i can enable/disable particular addons by name, in this case the addon would be 'scalars'
targets = modes or ["train", "eval", "predict"]
target_modes = [
{
"train": ModeKeys.TRAIN,
"eval": ModeKeys.EVAL,
"predict": ModeKeys.PREDICT,
}.get(trg_mode.lower())
for trg_mode in targets
]
applicable_addons = [
addon
for addon in addons
if mode in target_modes and aux.get(addon.__name__, True)
] # addons can be added for specific modes only, in this case I attach 'scalars' for "TRAIN" and "EVAL"
if order == "PRE": # this specifies whether to run the code in the addon before the model_fn or afterwards
for add_on in applicable_addons:
add_on(self, features, labels, mode, params)
spec = model_fn(self, features, labels, mode, params)
elif order == "POST": #in the case of 'scalars' I first run the model_fn then pass the spec through the 'scalars' fn above
spec = model_fn(self, features, labels, mode, params)
for add_on in applicable_addons:
spec = add_on(self, features, labels, spec, params)
return spec
return wrapper
return decorator
# I define a couple of partials based on the order of any functions i want to attach to a model_fn,
# this is just to make code more legible
prepare = partial(attach, order="PRE") # if it goes before the model_fn it's 'preparing' something
addon = partial(attach, order="POST") # if it goes after, it's an 'addon' to the model_fn
最后,在我的基本抽象模型类的_model_fn()
中,返回特定模型的实际model_fn
以及所有插件,看起来像是
@addon([scalars], ["TRAIN", "EVAL"])
def _model_fn(self, features, labels, mode, params):
return self.model_fn(features, labels, mode, params)
当我在本地运行模型时,以上所有方法都很好用。我不确定这是我什至可以解决的问题,也许有某种钩子或某种方式可以使工人等待船长完成检查点再继续?
任何想法将不胜感激,如果我错过与该问题相关的任何内容,我可以共享代码的其他部分。
预先感谢您的帮助