Question

我正在研究Python中流行的梯度下降算法的比较。 Here是我去过的笔记本的链接。

Adagrad算法的收敛速度比普通的vanilla批量，随机和小批量算法慢得多。我期待它是基本方法的改进。差异是由于下面的一个或多个因素还是其他因素，或者这是预期的结果吗？

测试数据集很小，Adagrad在较大的数据集上表现相对较好
与样本数据的特征有关的东西
与参数有关的事情
代码中的错误

以下是Adagrad的代码 - 它也是notebook中的最后一个代码。

def gd_adagrad(data, alpha, num_iter, b=1):
    m, N = data.shape
    Xy = np.ones((m,N+1))
    Xy[:,1:] = data
    theta = np.ones(N)
    grad_hist = 0
    for i in range(num_iter):
        np.random.shuffle(Xy)
        batches = np.split(Xy, np.arange(b, m, b))
        for B_x, B_y in ((B[:,:-1],B[:,-1]) for B in batches):
            loss_B = B_x.dot(theta) - B_y
            gradient = B_x.T.dot(loss_B) / B_x.shape[0]
            grad_hist += np.square(gradient)
            theta = theta - alpha * gradient / (10**-6 + np.sqrt(grad_hist))
    return theta

theta = gd_adagrad(data_norm, alpha*10, 150, 50)

缓慢的阿德格拉德收敛

0 个答案: