Question

我有一个关于在Stochastic GD期间更新theta的问题。我有两种方法来更新theta：

1）使用前面的theta，获取所有样本的所有假设，然后根据每个样本更新theta。像：

hypothese = np.dot(X, theta)
for i in range(0, m):
    theta = theta + alpha * (y[i] - hypothese[i]) * X[i]

2）另一种方式：在扫描样本期间，使用最新的theta更新假设[i]。像：

for i in range(0, m):
    h = np.dot(X[i], theta)
    theta = theta + alpha * (y[i] - h) * X[i]

我检查了SGD代码，看来第二种方式是正确的。但在编码过程中，第一个会更快收敛，结果比第二个好。为什么错误的方式比正确的方式表现更好？

我还附上了完整的代码如下：

def SGD_method1():
maxIter = 100 # max iterations
alpha = 1e4 # learning rate
m, n = np.shape(X)  # X[m,n], m:#samples, n:#features
theta = np.zeros(n) # initial theta
for iter in range(0, maxIter):
    hypothese = np.dot(X, theta)  # update all the hypoes using the same theta
    for i in range(0, m):
        theta = theta + alpha * (y[i] - hypothese[i]) * X[i]
return theta

def SGD_method2():
maxIter = 100 # max iterations
alpha = 1e4 # learning rate
m, n = np.shape(X)  # X[m,n], m:#samples, n:#features
theta = np.zeros(n) # initial theta
for iter in range(0, maxIter):
    for i in range(0, m):
        h = np.dot(X[i], theta)  #  update on hypo using the latest theta
        theta = theta + alpha * (y[i] -h) * X[i]
return theta

Answer 1

第一个代码是不是 SGD。这是一个传统的＆＃34; （批量）渐变下降。 随机性来自基于为一个样本（或小批量，称为mini-bach SGD）计算的梯度的更新。它显然不是误差函数的正确梯度（这是所有训练样本的误差之和），但可以证明，在合理的条件下，这种过程会收敛到局部最优。随机更新在许多应用中是优选的，因为它们简单并且（在许多情况下）便宜的计算。两种算法都正确（两者都在合理的假设下，保证了对局部最优的收敛），特定策略的选择取决于特定问题（特别是其大小和其他要求）。

在随机梯度下降期间，这两种更新假设方式之间的区别是什么？

1 个答案: