更多信息

我正在比较Tensorflow在不同系统上的训练性能，尤其是其中一个带有1个GPU，另一个带有4个GPU。我能够使分布式MirroredStrategy正常工作。

与1个GPU相比，镜像策略在4个GPU上运行需要更长的时间，但它也更加准确。因此，我想将模型保存在几个快照中，并计算评估数据集上的损失。但是，这种评估会产生大量开销，这可能是由于将数据移入和移出了GPU（当我以两倍的频率计算评估损失时，对于相同数量的最大步数，总时间是原来的两倍）。

因此，我决定将模型保存在检查点中，以便计时器仅引用训练时间，然后稍后返回模型以计算评估损失。我无法保存和还原Estimators的状态（我无法将Estimator转换为变量，并且不进行转换即表示没有变量）。

一个骇人听闻的解决方案是使用检查点文件，例如在步骤{1、51、101}，将它们一次移动到另一个目录，并声明另一个将该目录作为模型目录的估计量。但这失败了，如上所述。

我提供了main函数的完整代码，其余部分与原始的变体自动编码器类似：

def main(argv):
  del argv  # unused

  params = FLAGS.flag_values_dict()
  params["activation"] = getattr(tf.nn, params["activation"]) # this replaces the string with the function it stands for, e.g. "leaky_relu" with tf.nn.leaky_relu
  if FLAGS.delete_existing and tf.gfile.Exists(FLAGS.model_dir):
    tf.logging.warn("Deleting old log directory at {}".format(FLAGS.model_dir))
    tf.gfile.DeleteRecursively(FLAGS.model_dir)
  tf.gfile.MakeDirs(FLAGS.model_dir)

  if FLAGS.fake_data:
    train_input_fn, eval_input_fn = build_fake_input_fns(FLAGS.batch_size)
  else:
    train_input_fn, eval_input_fn = build_input_fns(FLAGS.data_dir,
                                                    FLAGS.batch_size)

  num_gpus = get_number_gpus()
  print("Found GPUs:" + str(num_gpus))

  if 1 == num_gpus:
      strategy = None
  else:
      if tf.__version__.startswith("1."):
          strategy = tf.contrib.distribute.MirroredStrategy()
      else:
          strategy = tf.distribute.MirroredStrategy()

  estimator = tf.estimator.Estimator(
      model_fn,
      params=params,
      config=tf.estimator.RunConfig(
          model_dir=FLAGS.model_dir,
          save_checkpoints_steps=FLAGS.viz_steps,
          train_distribute=strategy,
          keep_checkpoint_max=None
      ),
  )

  with tf.Session() as sess:

      sess.run(tf.global_variables_initializer())

      # All seed setting from
      # https://keras.io/getting-started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development
      tf.set_random_seed(0)
      np.random.seed(0)

      first_random_num = sess.run(tf.random.normal([100, 100], seed=0)[0, 0])
      print("First random number = %1.12f" % first_random_num)

      # Prepare the object that will hold the results
      chart = dict()

      # Start by running 1 step of the estimator to warm up any use of the GPU
      # then run every viz_steps
      steps = [1 + k * FLAGS.viz_steps for k in range(FLAGS.max_steps // FLAGS.viz_steps + 1)]
      start = time.time()
      lap = start
      print("Running at these steps: ")
      print(steps)

      # Run training and keep track of time
      previous_steps = 0
      for current_steps in steps:
          estimator.train(train_input_fn, steps=current_steps - previous_steps)

          new_lap = time.time()
          chart[current_steps] = {
              "iteration": current_steps,
              "total duration": new_lap - start,
              "lap duration": new_lap - lap,
          }
          lap = new_lap
          previous_steps = current_steps

      # Run an evaluation of the model to make sure that
      # checkpoints are working properly
      eval_results = estimator.evaluate(eval_input_fn)
      chart["final"] = float(eval_results["loss"])
      print("Loss at finish = %f" % eval_results["loss"])

      # This is the only way I found to restore
      # intermediately trained models: move the folder
      # with all the checkpoints elsewhere, then copy back the files with
      # a specific iteration
      saved_dir = "/host/saved"

      if os.path.exists(saved_dir):
          shutil.rmtree(saved_dir)
      shutil.move(FLAGS.model_dir, saved_dir)

      # Run evaluation over each checkpoint of the model
      for current_steps in steps:
          item = chart[current_steps]

          print("Ready to evaluate %d" % current_steps)

          eval_dir = os.path.join(FLAGS.model_dir, "eval%d" %current_steps)
          os.makedirs(eval_dir)

          # Copy back model files
          for f in glob.glob(os.path.join(saved_dir, "model.ckpt-%d.*" % current_steps)):
              shutil.copy(f, eval_dir)

          estimator = tf.estimator.Estimator(
              model_fn,
              params=params,
              config=tf.estimator.RunConfig(
                  model_dir=eval_dir,
                  eval_distribute=strategy,
              ),
          )

          start = time.time()
          eval_results = estimator.evaluate(eval_input_fn)
          item["evaluation loss"] = float(eval_results["loss"])
          item["evaluation duration (lap)"] = time.time() - start
          print("Loss = %f" % eval_results["loss"])

      # Save results to JSON file
      print(chart)
      with open(os.path.join(FLAGS.model_dir, "benchmark.json"), "w+") as f:
          f.write(json.dumps(chart))

benchmark.json的抽样结果，经过101步：

{"101": {"evaluation loss": 547.013916015625, ...}, ... "final": 190.50387573242188}

从检查点还原模型

更多信息

0 个答案: