从检查点还原模型

时间:2019-02-20 19:27:30

标签: tensorflow

我无法从检查点文件中恢复Tensorflow中的可变自动编码器。我遵循the official repo中的代码。出于下面详述的原因,我将文件复制到新目录eval_dir中并声明一个估算器:

      estimator = tf.estimator.Estimator(
          model_fn,
          params=params,
          config=tf.estimator.RunConfig(
              model_dir=eval_dir,
              eval_distribute=strategy,
          ),
      )

      eval_results = estimator.evaluate(eval_input_fn)
      print("Loss = %f" % eval_results["loss"])

但是评估损失约为547,而不是训练结束时的190。似乎声明此估计量不会使用检查点中保存的状态,尽管它可能会在进一步的训练中使用它。

如何从检查点文件中恢复估计器的状态?

更多信息

我正在比较Tensorflow在不同系统上的训练性能,尤其是其中一个带有1个GPU,另一个带有4个GPU。我能够使分布式MirroredStrategy正常工作。

与1个GPU相比,镜像策略在4个GPU上运行需要更长的时间,但它也更加准确。因此,我想将模型保存在几个快照中,并计算评估数据集上的损失。但是,这种评估会产生大量开销,这可能是由于将数据移入和移出了GPU(当我以两倍的频率计算评估损失时,对于相同数量的最大步数,总时间是原来的两倍)。

因此,我决定将模型保存在检查点中,以便计时器仅引用训练时间,然后稍后返回模型以计算评估损失。我无法保存和还原Estimators的状态(我无法将Estimator转换为变量,并且不进行转换即表示没有变量)。

一个骇人听闻的解决方案是使用检查点文件,例如在步骤{1、51、101},将它们一次移动到另一个目录,并声明另一个将该目录作为模型目录的估计量。但这失败了,如上所述。

我提供了main函数的完整代码,其余部分与原始的变体自动编码器类似:

def main(argv):
  del argv  # unused

  params = FLAGS.flag_values_dict()
  params["activation"] = getattr(tf.nn, params["activation"]) # this replaces the string with the function it stands for, e.g. "leaky_relu" with tf.nn.leaky_relu
  if FLAGS.delete_existing and tf.gfile.Exists(FLAGS.model_dir):
    tf.logging.warn("Deleting old log directory at {}".format(FLAGS.model_dir))
    tf.gfile.DeleteRecursively(FLAGS.model_dir)
  tf.gfile.MakeDirs(FLAGS.model_dir)

  if FLAGS.fake_data:
    train_input_fn, eval_input_fn = build_fake_input_fns(FLAGS.batch_size)
  else:
    train_input_fn, eval_input_fn = build_input_fns(FLAGS.data_dir,
                                                    FLAGS.batch_size)

  num_gpus = get_number_gpus()
  print("Found GPUs:" + str(num_gpus))

  if 1 == num_gpus:
      strategy = None
  else:
      if tf.__version__.startswith("1."):
          strategy = tf.contrib.distribute.MirroredStrategy()
      else:
          strategy = tf.distribute.MirroredStrategy()

  estimator = tf.estimator.Estimator(
      model_fn,
      params=params,
      config=tf.estimator.RunConfig(
          model_dir=FLAGS.model_dir,
          save_checkpoints_steps=FLAGS.viz_steps,
          train_distribute=strategy,
          keep_checkpoint_max=None
      ),
  )

  with tf.Session() as sess:

      sess.run(tf.global_variables_initializer())

      # All seed setting from
      # https://keras.io/getting-started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development
      tf.set_random_seed(0)
      np.random.seed(0)

      first_random_num = sess.run(tf.random.normal([100, 100], seed=0)[0, 0])
      print("First random number = %1.12f" % first_random_num)

      # Prepare the object that will hold the results
      chart = dict()

      # Start by running 1 step of the estimator to warm up any use of the GPU
      # then run every viz_steps
      steps = [1 + k * FLAGS.viz_steps for k in range(FLAGS.max_steps // FLAGS.viz_steps + 1)]
      start = time.time()
      lap = start
      print("Running at these steps: ")
      print(steps)

      # Run training and keep track of time
      previous_steps = 0
      for current_steps in steps:
          estimator.train(train_input_fn, steps=current_steps - previous_steps)

          new_lap = time.time()
          chart[current_steps] = {
              "iteration": current_steps,
              "total duration": new_lap - start,
              "lap duration": new_lap - lap,
          }
          lap = new_lap
          previous_steps = current_steps

      # Run an evaluation of the model to make sure that
      # checkpoints are working properly
      eval_results = estimator.evaluate(eval_input_fn)
      chart["final"] = float(eval_results["loss"])
      print("Loss at finish = %f" % eval_results["loss"])

      # This is the only way I found to restore
      # intermediately trained models: move the folder
      # with all the checkpoints elsewhere, then copy back the files with
      # a specific iteration
      saved_dir = "/host/saved"

      if os.path.exists(saved_dir):
          shutil.rmtree(saved_dir)
      shutil.move(FLAGS.model_dir, saved_dir)

      # Run evaluation over each checkpoint of the model
      for current_steps in steps:
          item = chart[current_steps]

          print("Ready to evaluate %d" % current_steps)

          eval_dir = os.path.join(FLAGS.model_dir, "eval%d" %current_steps)
          os.makedirs(eval_dir)

          # Copy back model files
          for f in glob.glob(os.path.join(saved_dir, "model.ckpt-%d.*" % current_steps)):
              shutil.copy(f, eval_dir)

          estimator = tf.estimator.Estimator(
              model_fn,
              params=params,
              config=tf.estimator.RunConfig(
                  model_dir=eval_dir,
                  eval_distribute=strategy,
              ),
          )

          start = time.time()
          eval_results = estimator.evaluate(eval_input_fn)
          item["evaluation loss"] = float(eval_results["loss"])
          item["evaluation duration (lap)"] = time.time() - start
          print("Loss = %f" % eval_results["loss"])

      # Save results to JSON file
      print(chart)
      with open(os.path.join(FLAGS.model_dir, "benchmark.json"), "w+") as f:
          f.write(json.dumps(chart))

benchmark.json的抽样结果,经过101步:

{"101": {"evaluation loss": 547.013916015625, ...}, ... "final": 190.50387573242188}

0 个答案:

没有答案