Question

我正在使用GridSearchCV来识别随机森林分类器的最佳参数集。

PARAMS = {
    'max_depth': [8,None],
    'n_estimators': [500,1000]
}
rf = RandomForestClassifier()
clf = grid_search.GridSearchCV(estimator=rf, param_grid=PARAMS, scoring='roc_auc', cv=5, n_jobs=4)
clf.fit(data, labels)

其中数据和标签分别是完整数据集和相应的标签。

现在，我将GridSearchCV（来自clf.grid_scores_）返回的性能与“手动”AUC估算进行了比较：

aucs = []
for fold in range (0,n_folds):
    probabilities = []
    train_data,train_label = read_data(train_file_fold)
    test_data,test_labels = read_data(test_file_fold)
    clf = RandomForestClassifier(n_estimators = 1000,max_depth=8)
    clf = clf.fit(train_data,train_labels)
    predicted_probs = clf.predict_proba(test_data)
    for value in predicted_probs:
       for k, pr in enumerate(value):
            if k == 1:
                probabilities.append(pr)
    fpr, tpr, thresholds = metrics.roc_curve(test_labels, probabilities, pos_label=1)   
    fold_auc = metrics.auc(fpr, tpr)
    aucs.append(fold_auc)

performance = np.mean(aucs)

我手动将数据预分割为训练和测试集（相同的5 CV方法）。

GridSearchCV使用相同参数时，RandomForest返回的AUC值始终高于手动计算的值（例如0.62对0.70）。我知道不同的训练和测试分裂可能会给你不同的性能但是在测试100次重复的GridSearchCV时这种情况不断发生。有趣的是，如果我使用accuarcy代替roc_auc作为评分指标，则性能差异很小，并且可能与我使用不同的培训和测试集的事实相关联。是否发生这种情况是因为GridSearchCV的AUC值的估算方式与使用metrics.roc_curve的方式不同？

Scikit学习GridSearchCV AUC性能

0 个答案: