为Adam Optimizer加权衰减的正确方法是什么?

时间:2017-06-09 08:08:42

标签: tensorflow deep-learning caffe torch mxnet

由于Adam Optimizer保持一对平均值,如渐变的均值/方差,我想知道它应该如何正确处理重量衰减。我已经看到了两种实现它的方法。

  1. 仅根据每个小批量明确的客观损失,衰减权重更新梯度的均值/方差。 (以下代码取自https://github.com/dmlc/mxnet/blob/v0.7.0/python/mxnet/optimizer.py

    weight[:] -= lr*mean/(sqrt(variance) + self.epsilon)
    
    wd = self._get_wd(index)
    if wd > 0.:
        weight[:] -= (lr * wd) * weight
    
  2. 根据客观损失+正则化损失更新梯度的均值/方差,并像往常一样更新权重。 (以下代码取自https://github.com/dmlc/mxnet/blob/master/src/operator/optimizer_op-inl.h#L210

    grad = scalar<DType>(param.rescale_grad) * grad +
    scalar<DType>(param.wd) * weight;
    // stuff
    Assign(out, req[0],
       weight -
       scalar<DType>(param.lr) * mean /
       (F<square_root>(var) + scalar<DType>(param.epsilon)));
    
  3. 这两种方法有时会在训练结果上显示出显着差异。而我实际上认为第一个更有意义(并且发现它会不时地提供更好的结果)。 Caffe和旧版本的mxnet遵循第一种方法,而火炬,tensorflow和新版本的mxnet遵循第二种方法。

    真的很感谢你的帮助!

2 个答案:

答案 0 :(得分:6)

编辑:另请参阅刚刚合并到TF的this PR

当使用纯SGD(没有动量)作为优化器时,权重衰减与向损失添加L2正则化项是一回事。 使用任何其他优化器时,情况并非如此。

体重衰减(不知道如何在这里使用TeX,请原谅我的伪符号):

w[t+1] = w[t] - learning_rate * dw - weight_decay * w

L2-正规化:

loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)

计算L2正则化中额外项的梯度得到lambda * w,从而将其插入SGD更新方程

dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw

与重量衰减相同,但将lambdalearning_rate混合。任何其他优化器,即使是具有动量的SGD,也会为L2正则化提供不同的权重衰减更新规则!有关详细信息,请参阅文章Fixing weight decay in Adam。 (编辑:AFAIK,this 1987 Hinton paper介绍&#34;体重衰减&#34;,字面意思为&#34;每次更新权重时,其数量也会减少0.4%&#34;在第10页)

话虽如此,似乎并没有支持&#34;正确的&#34; TensorFlow中的重量衰减了。讨论它有一些问题,特别是因为上面的论文。

实现它的一种可能方法是编写一个op,在每个优化器步骤之后手动执行衰减步骤。另一种方式,就是我目前正在做的,就是使用额外的SGD优化器来减轻重量,并且&#34;附加&#34;它到你的train_op。不过,这些都只是粗略的解决方案。我目前的代码:

# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
               weights_regularizer=layers.l2_regularizer(weight_decay)):
    # define the network.

loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
    with tf.control_dependencies([train_op]):
        sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
        train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))

这有点使用TensorFlow提供的簿记。请注意,arg_scope负责将每个图层的L2正则化项附加到REGULARIZATION_LOSSES图表键,然后我使用SGD对其进行求和并进行优化,如上所示,对应于实际重衰变。

希望有所帮助,如果有人为此获得更好的代码片段,或者TensorFlow更好地实现它(即在优化器中),请分享。

答案 1 :(得分:1)

我遇到了同样的问题。我认为我从here获得的这段代码对您有用。它通过继承signupUser来实现权重衰减亚当优化器。这是我找到的最干净的解决方案:

tf.train.Optimizer

您可以通过以下方式使用它(我进行了一些更改以使其在更一般的上下文中有用),该函数将返回一个class AdamWeightDecayOptimizer(tf.train.Optimizer): """A basic Adam optimizer that includes "correct" L2 weight decay.""" def __init__(self, learning_rate, weight_decay_rate=0.0, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=None, name="AdamWeightDecayOptimizer"): """Constructs a AdamWeightDecayOptimizer.""" super(AdamWeightDecayOptimizer, self).__init__(False, name) self.learning_rate = learning_rate self.weight_decay_rate = weight_decay_rate self.beta_1 = beta_1 self.beta_2 = beta_2 self.epsilon = epsilon self.exclude_from_weight_decay = exclude_from_weight_decay def apply_gradients(self, grads_and_vars, global_step=None, name=None): """See base class.""" assignments = [] for (grad, param) in grads_and_vars: if grad is None or param is None: continue param_name = self._get_variable_name(param.name) m = tf.get_variable( name=param_name + "/adam_m", shape=param.shape.as_list(), dtype=tf.float32, trainable=False, initializer=tf.zeros_initializer()) v = tf.get_variable( name=param_name + "/adam_v", shape=param.shape.as_list(), dtype=tf.float32, trainable=False, initializer=tf.zeros_initializer()) # Standard Adam update. next_m = ( tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) next_v = ( tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, tf.square(grad))) update = next_m / (tf.sqrt(next_v) + self.epsilon) # Just adding the square of the weights to the loss function is *not* # the correct way of using L2 regularization/weight decay with Adam, # since that will interact with the m and v parameters in strange ways. # # Instead we want ot decay the weights in a manner that doesn't interact # with the m/v parameters. This is equivalent to adding the square # of the weights to the loss with plain (non-momentum) SGD. if self._do_use_weight_decay(param_name): update += self.weight_decay_rate * param update_with_lr = self.learning_rate * update next_param = param - update_with_lr assignments.extend( [param.assign(next_param), m.assign(next_m), v.assign(next_v)]) return tf.group(*assignments, name=name) def _do_use_weight_decay(self, param_name): """Whether to use L2 weight decay for `param_name`.""" if not self.weight_decay_rate: return False if self.exclude_from_weight_decay: for r in self.exclude_from_weight_decay: if re.search(r, param_name) is not None: return False return True def _get_variable_name(self, param_name): """Get the variable name from the tensor name.""" m = re.match("^(.*):\\d+$", param_name) if m is not None: param_name = m.group(1) return param_name 可以在Session中使用:

train_op