Question

在Caffe中，我们有一个decay_ratio，通常设置为0.0005。然后，所有可训练的参数，例如FC6中的W矩阵将通过以下方式衰减： W = W *（1 - 0.0005）在我们应用渐变之后。

我经历了许多教程张量流代码，但没看到人们如何实现这种权重衰减来防止数值问题（非常大的绝对值）

我的经历，我经常在训练期间遇到100k迭代的数值问题。

我还会在stackoverflow上查看相关问题，例如： How to set weight cost strength in TensorFlow? 但是，解决方案似乎与Caffe中的实现略有不同。

有没有人有类似的顾虑？谢谢。

Answer 1

这是一个重复的问题：

How to define weight decay for individual layers in TensorFlow?

# Create your variables
weights = tf.get_variable('weights', collections=['variables'])

with tf.variable_scope('weights_norm') as scope:
  weights_norm = tf.reduce_sum(
  input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
      [tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
  ),
  name='weights_norm'
)

# Add the weight decay loss to another collection called losses
tf.add_to_collection('losses', weights_norm)

# Add the other loss components to the collection losses     
# ...

# To calculate your total loss
tf.add_n(tf.get_collection('losses'), name='total_loss')

你可以设置你想要的任何lambda值来减轻重量。以上只是增加了l2规范。

Answer 2

目前的答案是错误的，因为它没有像cuda-convnet / caffe那样给你正确的体重衰减。而是L2正则化，这是不同的。

当使用纯SGD（没有动量）作为优化器时，权重衰减与向损失添加L2正则化项是一回事。 使用任何其他优化器时，情况并非如此。

体重衰减（不知道如何在这里使用TeX，请原谅我的伪符号）：

w[t+1] = w[t] - learning_rate * dw - weight_decay * w

L2-正规化：

loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)

计算L2正则化中额外项的梯度得到lambda * w，从而将其插入SGD更新方程

dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw

与重量衰减相同，但将lambda与learning_rate混合。任何其他优化器，即使是具有动量的SGD，也会为L2正则化提供不同的权重衰减更新规则！有关详细信息，请参阅文章Fixing weight decay in Adam。（编辑：AFAIK，this 1987 Hinton paper介绍＆＃34;体重衰减＆＃34;，字面意思为＆＃34;每次更新权重时，其数量也会减少0.4％＆＃34;在第10页）

话虽如此，似乎并没有支持＆＃34;正确的＆＃34; TensorFlow中的重量衰减了。讨论它有一些问题，特别是因为上面的论文。

实现它的一种可能方法是编写一个op，在每个优化器步骤之后手动执行衰减步骤。另一种方式，就是我目前正在做的，就是使用额外的SGD优化器来减轻重量，并且＆＃34;附加＆＃34;它到你的train_op。不过，这些都只是粗略的解决方案。我目前的代码：

# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
               weights_regularizer=layers.l2_regularizer(weight_decay)):
    # define the network.

loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
    with tf.control_dependencies([train_op]):
        sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
        train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))

这有点使用TensorFlow提供的簿记。请注意，arg_scope负责将每个图层的L2正则化项附加到REGULARIZATION_LOSSES图表键，然后我使用SGD对其进行求和并进行优化，如上所示，对应于实际重衰变。

希望有所帮助，如果有人为此获得更好的代码片段，或者TensorFlow更好地实现它（即在优化器中），请分享。

编辑：另请参阅刚刚合并到TF的this PR。

如何在Caffe中实现张量流中的重量衰减

2 个答案: