我正在尝试使用以下代码使用5折交叉验证来评估不同的回归技术:
from sklearn.linear_model import Ridge, MultiTaskLasso as Lasso, ElasticNet as Elastic
from sklearn.model_selection import KFold
classifiers = [Ridge, Lasso, Elastic]
kf = KFold(n_splits=5)
splits = kf.split(x_bow)
for classifier in classifiers:
name = classifier.__name__
for i, (train_idx, test_idx) in enumerate(splits):
clf = classifier(alpha=1)
x_train_split = x_bow[train_idx,:]
y_train_split = y_np[train_idx,:]
x_test_split = x_bow[test_idx,:]
y_test_split = y_np[test_idx,:]
clf.fit(x_train_split, y_train_split)
prediction = clf.predict(x_test_split)
mae = np.mean(np.abs(prediction - y_test_split), axis=1)
print(f'{name} - split {i+1} - points mae {mae[0]:.2f} price mae {mae[1]:.2f}')
这产生以下结果:
Ridge - split 1 - points mae 3.22 price mae 1.71
Ridge - split 2 - points mae 0.47 price mae 0.41
Ridge - split 3 - points mae 0.23 price mae 0.11
Ridge - split 4 - points mae 0.11 price mae 0.20
Ridge - split 5 - points mae 0.36 price mae 0.67
MultiTaskLasso - split 1 - points mae 4.09 price mae 2.37
MultiTaskLasso - split 2 - points mae 0.26 price mae 0.20
MultiTaskLasso - split 3 - points mae 0.48 price mae 0.36
MultiTaskLasso - split 4 - points mae 0.39 price mae 0.28
MultiTaskLasso - split 5 - points mae 0.45 price mae 0.73
ElasticNet - split 1 - points mae 4.09 price mae 2.37
ElasticNet - split 2 - points mae 0.26 price mae 0.20
ElasticNet - split 3 - points mae 0.48 price mae 0.36
ElasticNet - split 4 - points mae 0.39 price mae 0.28
ElasticNet - split 5 - points mae 0.45 price mae 0.73
在查看输出时,我怀疑分类器在第一次分割后获得较低的错误,因为它评估了之前已经学过的分割。但是,我确实在for循环中创建了一个新的分类器,因此它应该为分类器创建一个新对象。 (所以第一次拆分不应该影响其他人。)
我的问题是:为什么第一个错误比其他错误更高,我怎么能解决这个问题。
答案 0 :(得分:0)
数据没有改组,异常值位于数据集的开头。