Question

我要分离X和y中的特征，然后在用k倍交叉验证将其分割后对火车测试数据进行预处理。之后，我将火车数据拟合到我的随机森林回归模型并计算置信度得分。拆分后为什么要进行预处理？因为人们告诉我这样做是比较正确的，并且为了我的模型表现，我坚持那个原则。

这是我第一次使用KFold交叉验证，因为我的模型评分过高，我认为我可以通过交叉验证来解决。我仍然对如何使用它感到困惑，我已经阅读了文档和一些文章，但是我并没有真正了解到它对我的模型的真正含义，但无论如何我还是尝试了，但是我的模型仍然适合。使用火车测试拆分或交叉验证得出的模型得分仍为0.999，我不知道我的错误是因为我是使用此方法的新手，但我认为也许我做错了，因此无法解决过度拟合问题。请告诉我我的代码有什么问题以及如何解决此问题

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as ss
avo_sales = pd.read_csv('avocados.csv')

avo_sales.rename(columns = {'4046':'small PLU sold',
                            '4225':'large PLU sold',
                            '4770':'xlarge PLU sold'},
                 inplace= True)

avo_sales.columns = avo_sales.columns.str.replace(' ','')
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)

# X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)


kf = KFold(n_splits=10)

for train_index, test_index in kf.split(x):
    X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]

impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()

imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])

le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])

rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
confidence = rfr.score(X_test, y_test)
print(confidence)

Answer 1

您过度拟合的原因是因为非正规的基于树的模型将根据数据进行调整，直到正确分类所有训练样本为止。例如，请参见以下图片：

如您所见，这并不能很好地概括。如果不指定参数来规范树，则该模型将很难拟合测试数据，因为它基本上只会学习训练数据中的噪声。在sklearn中有多种方法可以对树进行正则化，您可以在here中找到它们。例如：

max_features
min_samples_leaf
最大深度

通过适当的正则化，您可以得到一个可以很好地概括测试数据的模型。例如，查看正则化模型：

要规范化模型，请实例化RandomForestRegressor()模块，如下所示：

rfr = RandomForestRegressor(max_features=0.5, min_samples_leaf=4, max_depth=6)

这些参数值是任意的，由您自己确定最适合您数据的参数。您可以使用特定领域的知识来选择这些值，也可以使用诸如GridSearchCV或RandomizedSearchCV之类的超参数调整搜索。

除此之外，估算均值和中位数可能会给您的数据带来很多干扰。除非您别无选择，否则我建议您不要这样做。

Answer 2

虽然@NicolasGervais的答案是为什么您的特定模型过于拟合的原因，但我认为原始问题中的交叉验证存在概念误解；您似乎认为：

交叉验证是一种提高机器学习模型性能的方法。

但这不是情况。

交叉验证是一种用于估计给定模型在看不见数据上的性能的方法。就其本身而言，它不能提高准确性。换句话说，各个分数可以告诉您您的模型是否过度拟合训练数据，但是仅应用交叉验证并不能使您的模型更好。

示例：让我们看一下具有10个点的数据集，并使其穿过一条直线：

import numpy as np 
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

X = np.random.randint(0,10,10)
Y = np.random.randint(0,10,10)

fig = plt.figure(figsize=(1,10))

def line(x, slope, intercept):     
    return slope * x + intercept

for i in range(5):

    # note that this is not technically 5-fold cross-validation
    # because I allow the same datapoint to go into the test set
    # several times. For illustrative purposes it is fine imho.
    test_indices = np.random.choice(np.arange(10),2)
    train_indices = list(set(range(10))-set(test_indices))

    # get train and test sets
    X_train, Y_train = X[train_indices], Y[train_indices]
    X_test, Y_test = X[test_indices], Y[test_indices]
    # training set has one feature and multiple entries
    # so, reshape(-1,1)
    X_train, Y_train, X_test, Y_test = X_train.reshape(-1,1), Y_train.reshape(-1,1), X_test.reshape(-1,1), Y_test.reshape(-1,1)

    # fit and evaluate linear regression
    reg = LinearRegression().fit(X_train, Y_train)
    score_train = reg.score(X_train, Y_train)
    score_test = reg.score(X_test, Y_test)

    # extract coefficients from model:
    slope, intercept = reg.coef_[0], reg.intercept_[0]

    print(score_test)
    # show train and test sets
    plt.subplot(5,1,i+1)
    plt.scatter(X_train, Y_train, c='k')
    plt.scatter(X_test, Y_test, c='r')

    # draw regression line
    plt.plot(np.arange(10), line(np.arange(10), slope, intercept))
    plt.ylim(0,10)
    plt.xlim(0,10)

    plt.title('train: {:.2f} test: {:.2f}'.format(score_train, score_test))

您可以看到训练和测试集上的分数有很大的不同。您还可以看到，估计参数随火车和测试集的变化而变化很大。

这根本不会使您的线性模型更好。但是现在您完全知道它有多糟了：）

KFold交叉验证无法修复过度拟合

2 个答案: