How is cv_values_ computed in sklearn.linear::RidgeCV?

时间:2016-06-10 16:18:26

标签: python scikit-learn

The reproducible example to fix the discussion:

from sklearn.linear_model import RidgeCV
from sklearn.datasets import load_boston
from sklearn.preprocessing import scale 

boston = scale(load_boston().data)
target = load_boston().target

import numpy as np
alphas = np.linspace(1.0,200.0, 5)
fit0 = RidgeCV(alphas=alphas, store_cv_values = True, gcv_mode='eigen').fit(boston, target)
fit0.alpha_
fit0.cv_values_[:,0]

The question: what formula is used to compute fit0.cv_values_?

Edit:

@Abhinav Arora answer below seems to suggests that fit0.cv_values_[:,0][0], the first entry of fit0.cv_values_[:,0] would be

(fit1.predict(boston[0,].reshape(1, -1)) - target[0])**2

where fit1 is a ridge regression with alpha = 1.0, fitted to the data-set from which observation 0 was removed.

Let's see:

1) create new dataset with first row of original dataset removed:

from sklearn.linear_model import Ridge
boston1 = np.delete(boston, (0), axis=0)
target1 = np.delete(target, (0), axis=0)

2) fit a ridge model with alpha = 1.0 on this truncated dataset:

fit1 = Ridge(alpha=1.0).fit(boston1, target1)

3) check the MSE of that model on the first data-point:

(fit1.predict(boston[0,].reshape(1, -1)) - target[0])**2

it is array([ 37.64650853]) which is not the same as what is produced by the fit0.cv_values_[:,0], ergo:

fit0.cv_values_[:,0][0]

which is 37.495629960571137

What gives?

2 个答案:

答案 0 :(得分:3)

从Sklearn文档中引用:

  

每个alpha的交叉验证值(如果store_cv_values = True和   CV =无)。调用fit()后,此属性将包含   均方误差(默认情况下)或其值   {loss,score} _func函数(如果在构造函数中提供)。

由于您没有在构造函数中提供任何评分函数,也没有为构造函数中的cv参数提供任何内容,因此该属性应使用Leave-One out cross validation存储每个样本的均方误差。均方误差的通用公式是

Mean Squared Error

其中Y(带上限)是对回归量的预测,而另一个Y是真值。

在您的情况下,您正在进行Leave-One交叉验证。因此,在每个折叠中,您只有1个测试点,因此n = 1.因此,在您的情况下,执行fit0.cv_values_[:,0]只会为您提供训练数据集中每个点的平方误差。测试折叠,当α的值为1.0时,

希望有所帮助。

答案 1 :(得分:2)

Let's look - it's open source after all

第一次调用fit会调用其父级_BaseRidgeCV(第997行,在该实现中)。我们还没有提供交叉验证生成器,因此我们向_RidgeGCV.fit进行另一次调用。这个函数的文档中有很多数学,但我们非常接近源代码,我会让你去阅读它。

这是实际的来源

    v, Q, QT_y = _pre_compute(X, y)
    n_y = 1 if len(y.shape) == 1 else y.shape[1]
    cv_values = np.zeros((n_samples * n_y, len(self.alphas)))
    C = []

    scorer = check_scoring(self, scoring=self.scoring, allow_none=True)
    error = scorer is None

    for i, alpha in enumerate(self.alphas):
        weighted_alpha = (sample_weight * alpha
                          if sample_weight is not None
                          else alpha)
        if error:
            out, c = _errors(weighted_alpha, y, v, Q, QT_y)
        else:
            out, c = _values(weighted_alpha, y, v, Q, QT_y)
        cv_values[:, i] = out.ravel()
        C.append(c)

请注意令人兴奋的pre_compute功能

def _pre_compute(self, X, y):
    # even if X is very sparse, K is usually very dense
    K = safe_sparse_dot(X, X.T, dense_output=True)
    v, Q = linalg.eigh(K)
    QT_y = np.dot(Q.T, y)
    return v, Q, QT_y

Abinav已经解释了数学水平上发生了什么 - 它只是累积加权均方误差。可以从代码

逐步评估其实现的详细信息以及它与实现的不同之处