Question

我尝试在MNIST手写数字数据集上训练FeedForward神经网络（包括60K训练样本）。

我每次迭代所有训练样本，对每个时期的每个样本执行反向传播。运行时当然太长了。

我运行的算法名为 Gradient Descent ？

我读到，对于大型数据集，使用随机梯度下降可以显着改善运行时间。

如何使用随机渐变下降？我是否应该随机选择训练样本，对每个随机挑选的样本进行反向传播，而不是我目前使用的时期？

Answer 1

我会试着给你一些关于这个问题的直觉......

最初，您（正确）调用（批量）渐变下降的内容进行了更新。这确保权重中的每次更新都在“正确”方向上完成（图1）：最小化成本函数的方法。

随着数据集大小的增长和每个步骤中更复杂的计算，随机梯度下降在这些情况下成为首选。这里，在处理每个样本时完成权重的更新，因此，后续计算已经使用“改进的”权重。尽管如此，这个原因导致它在最小化误差函数时会产生一些误导（图2）。

因此，在许多情况下，最好使用小批量梯度下降，结合两者的优点：每次更新权重都是使用一小批数据完成的。这样，与随机更新相比，更新的方向有所纠正，但比（原始）渐变下降的情况更频繁地更新。

[更新] 根据要求，我在二进制分类中显示批次梯度下降的伪代码：

error = 0

for sample in data:
    prediction = neural_network.predict(sample)
    sample_error = evaluate_error(prediction, sample["label"]) # may be as simple as 
                                                # module(prediction - sample["label"])
    error += sample_error

neural_network.backpropagate_and_update(error)

（在多类标签的情况下，error表示每个标签的错误数组。）

此代码在给定次数的迭代中运行，或者在错误高于阈值时运行。对于随机梯度下降，在 for 循环内调用 neural_network.backpropagate_and_update（），并将样本错误作为参数。

Answer 2

您描述的新方案（对每个随机挑选的样本执行反向传播），是随机渐变下降的一种常见“风味”，如下所述：https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent

根据这份文件，最常见的3种口味是（你的味道是C）：

A）

randomly shuffle samples in the training set
for one or more epochs, or until approx. cost minimum is reached:
    for training sample i:
        compute gradients and perform weight updates

B）

for one or more epochs, or until approx. cost minimum is reached:
    randomly shuffle samples in the training set
    for training sample i:
        compute gradients and perform weight updates

C）

for iterations t, or until approx. cost minimum is reached:
    draw random sample from the training set
    compute gradients and perform weight updates

梯度下降与随机梯度下降算法

2 个答案: