Question

我已经停止了某些时候的训练并保存了检查点，元文件等。现在，当我想恢复训练时，我想从优化器的最后运行学习速率开始。你能提供这样做的例子吗？

Answer 1

对于那些来到这里（像我一样）的人，想知道最后的学习率是否会自动恢复：tf.train.exponential_decay没有向图表添加任何Variable，它只会添加必要的操作在给定某个global_step值的情况下，导出正确的当前学习速率值。这样，您只需要检查global_step值（通常默认情况下），并假设您保持相同的初始学习率，衰减步长和衰减因子，您将自动选择离开时的培训，并获得正确的学习率。

检查检查点不会显示任何learning_rate变量（或类似），只是因为不需要任何变量。

Answer 2

此示例代码学习添加两个数字：

import tensorflow as tf
import numpy as np
import os


save_ckpt_dir = './add_ckpt'
ckpt_filename = 'add.ckpt'

save_ckpt_path = os.path.join(save_ckpt_dir, ckpt_filename)

if not os.path.isdir(save_ckpt_dir):
    os.mkdir(save_ckpt_dir)

if [fname.startswith("add.ckpt") for fname in os.listdir(save_ckpt_dir)]:  # prefer to load pre-trained net
    load_ckpt_path = save_ckpt_path
else:
    load_ckpt_path = None  # train from scratch


def add_layer(inputs, in_size, out_size, activation_fn=None):

    Weights = tf.Variable(tf.ones([in_size, out_size]), name='Weights')
    biases = tf.Variable(tf.zeros([1, out_size]), name='biases')
    Wx_plus_b = tf.add(tf.matmul(inputs, Weights), biases)
    if activation_fn is None:
        layer_output = Wx_plus_b
    else:
        layer_output = activation_fn(Wx_plus_b)
    return layer_output


def produce_batch(batch_size=256):
    """Loads a single batch of data.

    Args:
      batch_size: The number of excersises in the batch.

    Returns:
      x : column vector of numbers
      y : another column of numbers
      xy_sum : the sum of the columns
    """
    x = np.random.random(size=[batch_size, 1]) * 10
    y = np.random.random(size=[batch_size, 1]) * 10
    xy_sum = x + y
    return x, y, xy_sum


with tf.name_scope("inputs"):
    xs = tf.placeholder(tf.float32, [None, 1])
    ys = tf.placeholder(tf.float32, [None, 1])

with tf.name_scope("correct_labels"):
    xysums = tf.placeholder(tf.float32, [None, 1])

with tf.name_scope("step_and_learning_rate"):
    global_step = tf.Variable(0, trainable=False)
    lr = tf.train.exponential_decay(0.15, global_step, 10, 0.96)  # start lr=0.15, decay every 10 steps with a base of 0.96

with tf.name_scope("graph_body"):
    prediction = add_layer(tf.concat([xs, ys], 1), 2, 1, activation_fn=None)

with tf.name_scope("loss_and_train"):
    # the error between prediction and real data
    loss = tf.reduce_mean(tf.reduce_sum(tf.square(xysums-prediction), reduction_indices=[1]))

    # Passing global_step to minimize() will increment it at each step.
    train_step = tf.train.AdamOptimizer(lr).minimize(loss, global_step=global_step)


with tf.name_scope("init_load_save"):
    init = tf.global_variables_initializer()
    saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)
    if load_ckpt_path:
        saver.restore(sess, load_ckpt_path)
    for i in range(1000):
        x, y, xy_sum = produce_batch(256)
        _, global_step_np, loss_np, lr_np = sess.run([train_step, global_step, loss, lr], feed_dict={xs: x, ys: y, xysums: xy_sum})
        if global_step_np % 100 == 0:
            print("global step: {}, loss: {}, learning rate: {}".format(global_step_np, loss_np, lr_np))

    saver.save(sess, save_ckpt_path)

如果你运行几次，你会看到学习率下降。它还节省了全球步骤。诀窍在于：

with tf.name_scope("step_and_learning_rate"):
    global_step = tf.Variable(0, trainable=False)
    lr = tf.train.exponential_decay(0.15, global_step, 10, 0.96)  # start lr=0.15, decay every 10 steps with a base of 0.96
...
train_step = tf.train.AdamOptimizer(lr).minimize(loss, global_step=global_step)

默认情况下，saver.save将保存所有可保存的对象（包括学习速率和全局步骤）。但是，如果tf.train.Saver随var_list一起提供，saver.save将只保存var_list中包含的变量：

saver = tf.train.Saver(var_list = ..list of vars to save..)

来源： https://www.tensorflow.org/api_docs/python/tf/train/exponential_decay

https://stats.stackexchange.com/questions/200063/tensorflow-adam-optimizer-with-exponential-decay

https://www.tensorflow.org/api_docs/python/tf/train/Saver（参见“可保存对象”）

如何从以前保存的检查点恢复TF的学习率？

2 个答案: