Question

我正在运行带有广泛交叉验证的随机森林模型，然后将我认为是GridSearch中的mean_test_score的grid.best_scorer与我的实际保留集进行比较。结果存在差异，在我更改参数之前没有出现。

我不确定是否可以得出几乎没有过度拟合的结论，我的gridSearch mean_train_score是否需要与我的mean_test_score相似，或者是我的mean_test_score是否需要与保留集中的得分相似。

如果这确实是一个问题，我希望您能帮助我确定可以在“随机森林”中进一步调整哪些参数，以进一步推广我的模型。无法进行进一步的功能选择/工程设计。我需要具有的功能，只是其中一些功能具有不可预测的离群值，因此需要保留。

编辑：如果我增加min_samples_leaf，我的RMSE会提高一点，但是CV和测试成绩都非常相似。这是防止过度拟合的正确方法，对吗？

def rf(df, score):

    X_train, X_test, y_train, y_test = train_test(df)

    params = {'n_estimators': [400, 700, 1000],
              'max_features': ['sqrt', 'auto'],
              'min_samples_split': [2, 3],
              'min_samples_leaf': [1, 2, 3],
              'max_depth': [50, 100, None],
              'bootstrap': [True, False]
}

    scorers = {'RMSE': make_scorer(rmse, greater_is_better=False),
               'MAE': make_scorer(mean_absolute_error, greater_is_better=False),
               'R2': make_scorer(r2_score)}

    cv = RepeatedKFold(n_splits=10, n_repeats=7)


    grid = GridSearchCV(estimator=RandomForestRegressor(random_state=random.seed(42)),
                              param_grid=params, 
                              verbose=1, 
                              cv=cv, 
                              n_jobs =-1, 
                              scoring=scorers, 
                              refit = score)

    grid = grid.fit(X_train, y_train)    

    print('Parameters used:', grid.best_params_)

    if score  == 'RMSE':
        print('RMSE score on CV:', round(-1*grid.best_score_,4))
        print('RMSE score on test: ', round(-1*grid.score(X_test, y_test),4))

    elif score == 'R2':
        print('R Squared CV on train:', round(grid.best_score_,4))
        print('R Squared score on test: ', round(grid.score(X_test, y_test),4))

    elif score == 'MAE':
        print('MAE score on CV:', round(-1*grid.best_score_,4))
        print('MAE score on test: ', round(-1*grid.score(X_test, y_test),4))

使用的参数：{'bootstrap'：False，'max_depth'：100，'max_features'：'sqrt'，'min_samples_leaf'：1，'min_samples_split'：2，'n_estimators'：400}

RMSE简历得分：8.489 RMSE测试得分：5.7952

我希望缩短两者之间的差距

我的交叉验证值和测试成绩之间的差异有问题吗？

0 个答案: