训练损失值爆炸式增长

时间:2021-06-12 14:40:18

标签: python tensorflow keras deep-learning sgd

我试图复制这篇论文 paper。我使用 SGDW 和学习率函数:

<块引用>

lr = base_lr * (1 + gamma * iter)^(-power)

我的自定义时间表:

class Example(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(
        self,
        initial_learning_rate=1e-5,
        gamma=0.001,
        power=0.75,
        decay_steps=1,
        verbose=1,
        name="Example",
    ):
        super(Example, self).__init__()
        self.initial_learning_rate = initial_learning_rate
        self.gamma = gamma
        self.power = power
        self.verbose = verbose
        self.decay_steps = decay_steps
        self.name = name

    def __call__(self, step):
        with ops.name_scope_v2(self.name or "Example") as name:
            initial_learning_rate = ops.convert_to_tensor_v2_with_dispatch(
                self.initial_learning_rate, name="initial_learning_rate"
            )
            _dtype = initial_learning_rate.dtype
            gamma = math_ops.cast(self.gamma, _dtype)
            power = math_ops.cast(self.power, _dtype)
            decay_steps = math_ops.cast(self.decay_steps, _dtype)

            global_step_recomp = math_ops.cast(step, _dtype)
            iteration = math_ops.divide(x=global_step_recomp, y=decay_steps)

            lr = math_ops.multiply(
                x=initial_learning_rate,
                y=math_ops.pow(
                    x=math_ops.add(
                        x=1.0, 
                        y=math_ops.multiply(
                            x=gamma, 
                            y=iteration
                        )
                    ),
                    y=-power,
                ),
                name=name,
            )
            return lr

我的优化器:

optimizer = tfa.optimizers.SGDW(
            learning_rate=Example(
                initial_learning_rate=1e-5,
                gamma=1e-3,
                power=0.75,
                verbose=smoke_test,
            ),
            weight_decay=5e-4,
            momentum=0.9,
        )

然而,损失和平均绝对误差在 30 个 epoch 后爆发。以下是训练过程图 Learning Rate on epochTrain and Validation Loss on epochTrain and Validation MAE on Epoch。 这里一定有问题。我尝试了不同的优化器和调度器,一切正常。

0 个答案:

没有答案