Question

我正在使用随机森林回归算法来拟合10维回归问题，其中包含约30万个样本。尽管在处理“随机森林”时不是必需的，但我首先将数据按相同的比例放置（通过使用sklearn进行预处理），然后在以下参数空间上进行了随机搜索：

    n_estimators=[int(x) for x in linspace (start=100, stop= 2000, num=11)]
    max_features= auto, sqrt
    max_depth= from 1- to 150 with step =11
    min_sampl_split=2,5,10,12
    min_samples_leaf=1,2,4,6
    Bootstrap true or false

此外，在获得最佳参数后，我进行了第二次更窄的搜索。尽管我在随机搜索中使用了十折交叉验证方案，但仍然遇到了严重的过拟合问题！此外，我还尝试使用DBSCAN算法检查异常值。排除数据集的某些部分后，我得到的结果甚至更糟！是否应在随机搜索中包括“随机森林”的其他参数？还是应该在拟合之前对数据集应用更多的预处理技术？

为方便起见，这是我写的实现：

from sklearn.model_selection import ShuffleSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start = 1, stop = 
15, num = 15)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10,12]
min_samples_leaf = [1, 2, 4,6]
bootstrap = [True, False]
cv = ShuffleSplit(n_splits=10, test_size=0.01, random_state=0)

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions 
= random_grid, n_iter = 50, cv = cv, verbose=2, random_state=42, 
n_jobs = 32)
rf_random.fit(x_train, y_train)

randomizedsearch函数返回的最佳参数：引导程序：Fasle。 Min_samples_leaf = 2。 n_estimators =1647。最大功能：sqrt。 min_samples_split = 3。最大深度：无。

目标范围是0到10000 [单位]。该模型的结果是，训练集的RMSE精度为6.98 [unit]，测试集的平均值为67.54 [unit] RMSE。

Answer 1

那条线

max_depth= from 1- to 150 with step =11

对于具有10个特征的问题，最佳深度在10以下。您由于过度拟合而过度拟合。考虑在第1步中将max_depth从1设置为15

min_sampl_split=2,5,10,12
min_samples_leaf=1,2,4,6

这应该有助于减少差异，但是，针对max_depth的11步正在扼杀您可能做出的所有努力

与Randomforest学习者的差异很大

1 个答案: