Question

我第一次训练随机森林模型，我发现了这种情况。

我对训练集的准确性，使用默认参数（如 https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html）非常高，等于或大于0.95，看起来很像过拟合。在测试集上，准确性达到0.66。我的目标是减少模型的过拟合，以期改善测试集的性能。
我尝试使用以下网格（如https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74）和以下网格进行随机五次交叉验证：

n_estimators = [16,32,64,128]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

最佳模型的折痕精度为0.7。

我在训练集和测试集的第2步中使用了最佳选择的参数，但训练集的准确度再次为0.95，测试集为0.66。

有什么建议吗？您认为这里发生了什么？如何获得结果以避免过度拟合（并可能提高模型性能）？

Answer 1

在这里有人遇到了同样的问题，并得到了一些有用的答案： https://stats.stackexchange.com/questions/111968/random-forest-how-to-handle-overfitting

您使用5倍交叉验证的方法已经非常好，可以通过使用10倍交叉验证来改进。

您可以问自己的另一个问题是关于数据集的质量。您的课程平衡了吗？如果不是这样，您可以尝试解决阶级失衡问题，因为失衡通常会偏重多数党。

数据集可能还不够大，增加数据集也可能会提高性能。

我希望这会有所帮助。

避免随机森林过度拟合

1 个答案: