Sklearn Lasso回归比Ridge Regression更糟糕几个数量级?

时间:2016-03-01 04:36:24

标签: python machine-learning scikit-learn linear-regression

我目前使用sklearn.linear_model模块实现了Ridge和Lasso回归。

然而,套索回归似乎在同一数据集上差了3个数量级!

我不确定出了什么问题,因为在数学上,这不应该发生。这是我的代码:

def ridge_regression(X_train, Y_train, X_test, Y_test, model_alpha):
    clf = linear_model.Ridge(model_alpha)
    clf.fit(X_train, Y_train)
    predictions = clf.predict(X_test)
    loss = np.sum((predictions - Y_test)**2)
    return loss

def lasso_regression(X_train, Y_train, X_test, Y_test, model_alpha):
    clf = linear_model.Lasso(model_alpha)
    clf.fit(X_train, Y_train)
    predictions = clf.predict(X_test)
    loss = np.sum((predictions - Y_test)**2)
    return loss


X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=0.1, random_state=0)
for alpha in [0, 0.01, 0.1, 0.5, 1, 2, 5, 10, 100, 1000, 10000]:
    print("Lasso loss for alpha=" + str(alpha) +": " + str(lasso_regression(X_train, Y_train, X_test, Y_test, alpha)))

for alpha in [1, 1.25, 1.5, 1.75, 2, 5, 10, 100, 1000, 10000, 100000, 1000000]:
    print("Ridge loss for alpha=" + str(alpha) +": " + str(ridge_regression(X_train, Y_train, X_test, Y_test, alpha)))

这是我的输出:

Lasso loss for alpha=0: 20575.7121727
Lasso loss for alpha=0.01: 19762.8763969
Lasso loss for alpha=0.1: 17656.9926418
Lasso loss for alpha=0.5: 15699.2014387
Lasso loss for alpha=1: 15619.9772649
Lasso loss for alpha=2: 15490.0433166
Lasso loss for alpha=5: 15328.4303197
Lasso loss for alpha=10: 15328.4303197
Lasso loss for alpha=100: 15328.4303197
Lasso loss for alpha=1000: 15328.4303197
Lasso loss for alpha=10000: 15328.4303197
Ridge loss for alpha=1: 61.6235890425
Ridge loss for alpha=1.25: 61.6360790934
Ridge loss for alpha=1.5: 61.6496312133
Ridge loss for alpha=1.75: 61.6636076713
Ridge loss for alpha=2: 61.6776331539
Ridge loss for alpha=5: 61.8206621527
Ridge loss for alpha=10: 61.9883144732
Ridge loss for alpha=100: 63.9106882674
Ridge loss for alpha=1000: 69.3266510866
Ridge loss for alpha=10000: 82.0056669678
Ridge loss for alpha=100000: 88.4479064159
Ridge loss for alpha=1000000: 91.7235727543

知道为什么吗?

谢谢!

1 个答案:

答案 0 :(得分:1)

有趣的问题。我可以确认这不是算法实现的问题,而是对输入的正确响应。

这是一个想法:你没有规范我从你的描述中相信的数据。这可能会导致不稳定,因为您的功能具有显着不同的数量级和方差。套索比山脊更“全有或全无”(你可能已经注意到它选择的系数多于岭数0),因此不稳定性会被放大。

尝试规范化您的数据,看看您是否更喜欢结果。

另一个想法:这可能是伯克利老师的故意,突出了脊和套索之间根本不同的行为。