我无法从检查点文件中恢复Tensorflow中的可变自动编码器。我遵循the official repo中的代码。出于下面详述的原因,我将文件复制到新目录eval_dir
中并声明一个估算器:
estimator = tf.estimator.Estimator(
model_fn,
params=params,
config=tf.estimator.RunConfig(
model_dir=eval_dir,
eval_distribute=strategy,
),
)
eval_results = estimator.evaluate(eval_input_fn)
print("Loss = %f" % eval_results["loss"])
但是评估损失约为547,而不是训练结束时的190。似乎声明此估计量不会使用检查点中保存的状态,尽管它可能会在进一步的训练中使用它。
如何从检查点文件中恢复估计器的状态?
我正在比较Tensorflow在不同系统上的训练性能,尤其是其中一个带有1个GPU,另一个带有4个GPU。我能够使分布式MirroredStrategy
正常工作。
与1个GPU相比,镜像策略在4个GPU上运行需要更长的时间,但它也更加准确。因此,我想将模型保存在几个快照中,并计算评估数据集上的损失。但是,这种评估会产生大量开销,这可能是由于将数据移入和移出了GPU(当我以两倍的频率计算评估损失时,对于相同数量的最大步数,总时间是原来的两倍)。
因此,我决定将模型保存在检查点中,以便计时器仅引用训练时间,然后稍后返回模型以计算评估损失。我无法保存和还原Estimators的状态(我无法将Estimator转换为变量,并且不进行转换即表示没有变量)。
一个骇人听闻的解决方案是使用检查点文件,例如在步骤{1、51、101},将它们一次移动到另一个目录,并声明另一个将该目录作为模型目录的估计量。但这失败了,如上所述。
我提供了main
函数的完整代码,其余部分与原始的变体自动编码器类似:
def main(argv):
del argv # unused
params = FLAGS.flag_values_dict()
params["activation"] = getattr(tf.nn, params["activation"]) # this replaces the string with the function it stands for, e.g. "leaky_relu" with tf.nn.leaky_relu
if FLAGS.delete_existing and tf.gfile.Exists(FLAGS.model_dir):
tf.logging.warn("Deleting old log directory at {}".format(FLAGS.model_dir))
tf.gfile.DeleteRecursively(FLAGS.model_dir)
tf.gfile.MakeDirs(FLAGS.model_dir)
if FLAGS.fake_data:
train_input_fn, eval_input_fn = build_fake_input_fns(FLAGS.batch_size)
else:
train_input_fn, eval_input_fn = build_input_fns(FLAGS.data_dir,
FLAGS.batch_size)
num_gpus = get_number_gpus()
print("Found GPUs:" + str(num_gpus))
if 1 == num_gpus:
strategy = None
else:
if tf.__version__.startswith("1."):
strategy = tf.contrib.distribute.MirroredStrategy()
else:
strategy = tf.distribute.MirroredStrategy()
estimator = tf.estimator.Estimator(
model_fn,
params=params,
config=tf.estimator.RunConfig(
model_dir=FLAGS.model_dir,
save_checkpoints_steps=FLAGS.viz_steps,
train_distribute=strategy,
keep_checkpoint_max=None
),
)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# All seed setting from
# https://keras.io/getting-started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development
tf.set_random_seed(0)
np.random.seed(0)
first_random_num = sess.run(tf.random.normal([100, 100], seed=0)[0, 0])
print("First random number = %1.12f" % first_random_num)
# Prepare the object that will hold the results
chart = dict()
# Start by running 1 step of the estimator to warm up any use of the GPU
# then run every viz_steps
steps = [1 + k * FLAGS.viz_steps for k in range(FLAGS.max_steps // FLAGS.viz_steps + 1)]
start = time.time()
lap = start
print("Running at these steps: ")
print(steps)
# Run training and keep track of time
previous_steps = 0
for current_steps in steps:
estimator.train(train_input_fn, steps=current_steps - previous_steps)
new_lap = time.time()
chart[current_steps] = {
"iteration": current_steps,
"total duration": new_lap - start,
"lap duration": new_lap - lap,
}
lap = new_lap
previous_steps = current_steps
# Run an evaluation of the model to make sure that
# checkpoints are working properly
eval_results = estimator.evaluate(eval_input_fn)
chart["final"] = float(eval_results["loss"])
print("Loss at finish = %f" % eval_results["loss"])
# This is the only way I found to restore
# intermediately trained models: move the folder
# with all the checkpoints elsewhere, then copy back the files with
# a specific iteration
saved_dir = "/host/saved"
if os.path.exists(saved_dir):
shutil.rmtree(saved_dir)
shutil.move(FLAGS.model_dir, saved_dir)
# Run evaluation over each checkpoint of the model
for current_steps in steps:
item = chart[current_steps]
print("Ready to evaluate %d" % current_steps)
eval_dir = os.path.join(FLAGS.model_dir, "eval%d" %current_steps)
os.makedirs(eval_dir)
# Copy back model files
for f in glob.glob(os.path.join(saved_dir, "model.ckpt-%d.*" % current_steps)):
shutil.copy(f, eval_dir)
estimator = tf.estimator.Estimator(
model_fn,
params=params,
config=tf.estimator.RunConfig(
model_dir=eval_dir,
eval_distribute=strategy,
),
)
start = time.time()
eval_results = estimator.evaluate(eval_input_fn)
item["evaluation loss"] = float(eval_results["loss"])
item["evaluation duration (lap)"] = time.time() - start
print("Loss = %f" % eval_results["loss"])
# Save results to JSON file
print(chart)
with open(os.path.join(FLAGS.model_dir, "benchmark.json"), "w+") as f:
f.write(json.dumps(chart))
benchmark.json
的抽样结果,经过101步:
{"101": {"evaluation loss": 547.013916015625, ...}, ... "final": 190.50387573242188}