Question

我上周开始学习机器学习。当我想制作一个梯度下降脚本来估计模型参数时，我遇到了一个问题：如何选择合适的学习率和方差。我发现，不同的（学习率，方差）对可能会导致不同的结果，有些你甚至无法收敛。此外，如果更改为另一个训练数据集，选择良好（学习率，方差）对可能不起作用。例如（下面的脚本），当我将学习率设置为0.001并且方差为0.00001时，对于'data1'，我可以得到合适的theta0_guess和theta1_guess。但是对于'data2'，它们无法使算法收敛，即使我尝试了几十个（学习率，方差）对仍然无法达到收敛。

所以如果有人能告诉我有一些标准或方法来确定（学习率，方差）对。

import sys

data1 = [(0.000000,95.364693) ,
    (1.000000,97.217205) ,
    (2.000000,75.195834),
    (3.000000,60.105519) ,
    (4.000000,49.342380),
    (5.000000,37.400286),
    (6.000000,51.057128),
    (7.000000,25.500619),
    (8.000000,5.259608),
    (9.000000,0.639151),
    (10.000000,-9.409936),
    (11.000000, -4.383926),
    (12.000000,-22.858197),
    (13.000000,-37.758333),
    (14.000000,-45.606221)]

data2 = [(2104.,400.),
     (1600.,330.),
     (2400.,369.),
     (1416.,232.),
     (3000.,540.)]

def create_hypothesis(theta1, theta0):
    return lambda x: theta1*x + theta0

def linear_regression(data, learning_rate=0.001, variance=0.00001):


    theta0_guess = 1.
    theta1_guess = 1.


    theta0_last = 100.
    theta1_last = 100.

    m = len(data)

    while (abs(theta1_guess-theta1_last) > variance or abs(theta0_guess - theta0_last) > variance):

        theta1_last = theta1_guess
        theta0_last = theta0_guess

        hypothesis = create_hypothesis(theta1_guess, theta0_guess)

        theta0_guess = theta0_guess - learning_rate * (1./m) * sum([hypothesis(point[0]) - point[1] for point in data])
        theta1_guess = theta1_guess - learning_rate * (1./m) * sum([ (hypothesis(point[0]) - point[1]) * point[0] for point in data])   

    return ( theta0_guess,theta1_guess )



points = [(float(x),float(y)) for (x,y) in data1]

res = linear_regression(points)
print res

Answer 1

绘图是查看算法性能的最佳方式。为了看你是否已经实现了收敛，你可以在每次迭代之后绘制成本函数的演变，在一定的迭代后你会看到它没有太大改进你可以假设收敛，看看下面的代码：

cost_f = []
while (abs(theta1_guess-theta1_last) > variance or abs(theta0_guess - theta0_last) > variance):

    theta1_last = theta1_guess
    theta0_last = theta0_guess

    hypothesis = create_hypothesis(theta1_guess, theta0_guess)
    cost_f.append((1./(2*m))*sum([ pow(hypothesis(point[0]) - point[1], 2) for point in data]))

    theta0_guess = theta0_guess - learning_rate * (1./m) * sum([hypothesis(point[0]) - point[1] for point in data])
    theta1_guess = theta1_guess - learning_rate * (1./m) * sum([ (hypothesis(point[0]) - point[1]) * point[0] for point in data])   

import pylab
pylab.plot(range(len(cost_f)), cost_f)
pylab.show()

将绘制以下图形（使用learning_rate = 0.01，variance = 0.00001执行）

正如您所看到的，经过一千次迭代后，您无法获得太多改进。如果成本函数在一次迭代中减少小于0.001，我通常会声明收敛，但这只是基于我自己的经验。

为了选择学习率，你可以做的最好的事情是绘制成本函数，看看它是如何表现的，并且始终记住这两件事：

如果学习率太小，你会得到慢收敛
如果学习率太大，您的成本函数可能不会在每次迭代中减少，因此它不会收敛

如果您运行代码，请选择learning_rate＆gt; 0.029和方差= 0.001你将在第二种情况下，梯度下降不会收敛，而如果你选择值learning_rate＆lt; 0.0001，variance = 0.001您将看到您的算法需要大量迭代才能收敛。

与learning_rate = 0.03不收敛的示例

Learning_rate = 0.0001的慢收敛示例 enter image description here

Answer 2

有许多方法可以保证梯度下降算法的收敛。有线搜索，与梯度的Lipschitz常数相关的固定步长（在函数的情况下。如果是像你这样的表，你可以在连续值之间产生差异），减小步长每次迭代和其他一些。其中一些can be found here。

如何确定梯度下降算法的学习率和方差？

2 个答案: