Question

我正在使用python在TensorFlow中编写优化器。

如何计算作为神经元传入连接而连接的张量值子集的值？

例如，让我们采用具有动量项的随机梯度下降优化器。动量项是针对每个连接单独计算的。现在我想通过计算连接到同一神经元的连接的所有动量值的平均值来计算一个连接的动量。

在此图片中，您可以看到两个连接到神经元3的连接作为传入连接。对于一个连接的权重更新，应考虑两个连接。通常，连接（1,3）的更新仅包括梯度（1,3）和动量（1,3）。为了更新连接（1,3），我想使用动量（1,3）和动量（2,3）的平均值。

让我们看看一个简单的完全连接的神经网络，它有一个输入神经元，两个隐藏层，每个隐藏层有两个神经元和一个输出神经元：

如果我们看一下神经元2和神经元5之间连接的权重更新的动量的正常计算（在代码中称为“累积”），我们只考虑上一次的动量。

我们可以从下面的python实现中看到正常的“累积”更新计算：

accumulation = self.get_slot(var, "a")
accumulation_update = grad + (mu_t * accumulation)

对于神经元2和神经元5之间的连接，积累看起来像这样：

$accumulationUpdate_{2,5} = grad_{2,5} + (\mu * accumulation_{2,5})$

这是应该改变的部分。新动量计算应取所有连接的平均值，这些连接作为与计算权重更新的连接相同的神经元连接。查看示例神经网络，连接（2,5）的“累积”值是连接（2,5）和（3,5）的“累积”值的平均值。这些都是神经元5的传入连接。

“累积”更新以下列方式更改：

accumulation = self.get_slot(var, "a")
accumulation_means = # Code to calculate all mean values for all neurons
accumulation_update = grad + (mu_t * accumulation_means) # Use the means for the accumulation_update

现在按以下方式计算连接（2,5）的累积更新计算：

accumulation_mean = (accumulation(2, 5) + accumulation(3, 5)) / 2
accumulation_update(2, 5) = grad(2, 5) + (mu_t * accumulation_mean)

对于每个连接，此计算都以相同的方式完成：

$calculation for all connections$

这里是具有动量的随机梯度下降的python实现：

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from tensorflow.python.framework import ops
from tensorflow.python.ops import control_flow_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.ops import state_ops
from tensorflow.python.training import optimizer


class SGDmomentum(optimizer.Optimizer):
    def __init__(self, learning_rate=0.001, momentum_term=0.9, use_locking=False, name="SGDmomentum"):
        super(SGDmomentum, self).__init__(use_locking, name)
        self._lr = learning_rate
        self._mu = momentum_term

        self._lr_t = None
        self._mu_t = None

    def _create_slots(self, var_list):
        for v in var_list:
            self._zeros_slot(v, "a", self._name)

    def _apply_dense(self, grad, var):
        lr_t = math_ops.cast(self._lr_t, var.dtype.base_dtype)
        mu_t = math_ops.cast(self._mu_t, var.dtype.base_dtype)
        accumulation = self.get_slot(var, "a")

        accumulation_update = grad + (mu_t * accumulation)
        accumulation_t = state_ops.assign(accumulation, accumulation_update, use_locking=self._use_locking)

        var_update = lr_t * accumulation_t
        var_t = state_ops.assign_sub(var, var_update, use_locking=self._use_locking)

        return control_flow_ops.group(*[var_t, accumulation_t])

    def _prepare(self):
        self._lr_t = ops.convert_to_tensor(self._lr, name="learning_rate")
        self._mu_t = ops.convert_to_tensor(self._mu, name="momentum_term")

我正在测试的神经网络（MNIST）：https://github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py

如何在现有的MWE代码中实现所描述的“累积”值的平均值？

正如旁注：

MWE不是我现实生活中的情景。这只是一个简单的工作示例，用于解释和解决我正在尝试解决的问题。

我正在python中编写优化器，因为我无法在Windows上构建TensorFlow，因此无法编译C ++文件。我确实花了很多时间试图在Windows上构建，我不能浪费更多时间在它上面。 python中的优化器对我来说已经足够了，因为我现在只是原型设计。

我是tensorflow和python的新手。我在文档中找不到关于此主题的任何内容。将我链接到一个来源会很棒。此外，张量的内部结构对我来说是不易消化的，而我在尝试事物时得到的错误信息对我来说是不可理解的。在解释某些内容时请记住这一点。

Answer 1

我们以神经元2,3,4,5为例来计算新动量。我们忽略了这些偏见，只考虑权重：

我们使用 W 作为权重矩阵， G 用于 W ， M 的相应渐变相应动量的矩阵，\ tilde {\ bm {M}}是平均矩阵。

所以新动力的更新是

我在你提议的SGDmomentum类中更改了一些代码并在MNIST示例上运行它而没有错误，我认为你已经完成了。

def _apply_dense(self, grad, var):
    lr_t = math_ops.cast(self._lr_t, var.dtype.base_dtype)
    mu_t = math_ops.cast(self._mu_t, var.dtype.base_dtype)
    accumulation = self.get_slot(var, "a")

    param_dims = len(accumulation.get_shape().as_list())
    if param_dims == 2:  # fc layer weights
        accumulation_mean = tf.reduce_mean(accumulation, axis=1, keep_dims=True)
    elif param_dims == 1:  # biases
        accumulation_mean = accumulation
    else:  # cnn? or others
        # TODO: improvement
        accumulation_mean = accumulation

    accumulation_update = grad + (mu_t * accumulation_mean)  # broadcasting is supported by tf.add()
    accumulation_t = state_ops.assign(accumulation, accumulation_update, use_locking=self._use_locking)

    var_update = lr_t * accumulation_t
    var_t = state_ops.assign_sub(var, var_update, use_locking=self._use_locking)

    return control_flow_ops.group(*[var_t, accumulation_t])

进行培训，

with tf.name_scope('train'):
    train_step = SGDmomentum(FLAGS.learning_rate, 0.9).minimize(cross_entropy)
    # train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(
    #     cross_entropy)

目前，该算法收敛速度低于传统的SGD，其动量在MNIST上。

至于其他阅读来源，我不知道Stanford CS231n是否可以帮助你Gradient Descent和SGD with momentum。可能你已经知道了。

如果你仍然对渐变张量的矩阵结构的使用感到困惑，那么试着接受它，因为这里矩阵和单个标量几乎没有区别。

我在这里所做的只是将您问题中每个accumulationUpdate_*的计算转换为矩阵形式。

计算优化器

1 个答案: