交叉验证中第一次拆分的错误高于拆分的其余部分

时间:2018-03-28 20:04:57

标签: python scikit-learn regression cross-validation

我正在尝试使用以下代码使用5折交叉验证来评估不同的回归技术:

from sklearn.linear_model import Ridge, MultiTaskLasso as Lasso, ElasticNet as Elastic  
from sklearn.model_selection import KFold

classifiers = [Ridge, Lasso, Elastic]

kf = KFold(n_splits=5)
splits = kf.split(x_bow)

for classifier in classifiers:
    name = classifier.__name__

    for i, (train_idx, test_idx) in enumerate(splits):
        clf = classifier(alpha=1)

        x_train_split = x_bow[train_idx,:]
        y_train_split = y_np[train_idx,:]
        x_test_split = x_bow[test_idx,:]
        y_test_split = y_np[test_idx,:]

        clf.fit(x_train_split, y_train_split)
        prediction = clf.predict(x_test_split)
        mae = np.mean(np.abs(prediction - y_test_split), axis=1)
        print(f'{name} - split {i+1} - points mae {mae[0]:.2f} price mae {mae[1]:.2f}')

这产生以下结果:

Ridge - split 1 - points mae 3.22 price mae 1.71
Ridge - split 2 - points mae 0.47 price mae 0.41
Ridge - split 3 - points mae 0.23 price mae 0.11
Ridge - split 4 - points mae 0.11 price mae 0.20
Ridge - split 5 - points mae 0.36 price mae 0.67
MultiTaskLasso - split 1 - points mae 4.09 price mae 2.37
MultiTaskLasso - split 2 - points mae 0.26 price mae 0.20
MultiTaskLasso - split 3 - points mae 0.48 price mae 0.36
MultiTaskLasso - split 4 - points mae 0.39 price mae 0.28
MultiTaskLasso - split 5 - points mae 0.45 price mae 0.73
ElasticNet - split 1 - points mae 4.09 price mae 2.37
ElasticNet - split 2 - points mae 0.26 price mae 0.20
ElasticNet - split 3 - points mae 0.48 price mae 0.36
ElasticNet - split 4 - points mae 0.39 price mae 0.28
ElasticNet - split 5 - points mae 0.45 price mae 0.73

在查看输出时,我怀疑分类器在第一次分割后获得较低的错误,因为它评估了之前已经学过的分割。但是,我确实在for循环中创建了一个新的分类器,因此它应该为分类器创建一个新对象。 (所以第一次拆分不应该影响其他人。)

我的问题是:为什么第一个错误比其他错误更高,我怎么能解决这个问题。

1 个答案:

答案 0 :(得分:0)

数据没有改组,异常值位于数据集的开头。