如何获得GridSearchCV的best_estimator的混淆矩阵

时间:2018-04-11 19:20:22

标签: python machine-learning grid-search hyperparameters

我正在使用RandomForestClassifierGridSearchCV进行参数调整。出于评估目的,我想要best_estimator的混淆矩阵,据我所知,GridSearchCV没有保存。

gs = GridSearchCV(RandomForestClassifier(n_estimators=1000, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1)
gs.fit(X_Distances, Y)
results = gs.cv_results_

我使用给定参数初始化gridsearch以接收best_parameters。最后,我使用最佳参数来复制gridsearch的{​​{1}},并进行分层交叉验证。我假设我正在使用与gridSearchCV相同的训练/测试数据来训练/验证best_estimator,因为我使用相同的参数和交叉验证选项(分层+3倍)。

best_estimator

我对过度拟合有些担忧。是否有更简单的方法来获取rf = RandomForestClassifier(n_estimators=1000, min_samples_leaf=7, max_depth=18, criterion='entropy', random_state=42) accuracy = [] metrics = {'accuracy':[], 'precision':[], 'recall':[], 'fscore':[], 'support':[]} counter = 0 print('################################################### RandomForest ###################################################') skf = StratifiedKFold(n_splits=3, random_state=42, shuffle=False) for train_index, test_index in skf.split(X_Distances,Y): X_train, X_test = X_Distances[train_index], X_Distances[test_index] y_train, y_test = Y[train_index], Y[test_index] rf.fit(X_train, y_train) y_pred = rf.predict(X_test) precision, recall, fscore, support = np.round(score(y_test, y_pred), 2) metrics['accuracy'].append(round(accuracy_score(y_test, y_pred), 2)) metrics['precision'].append(precision) metrics['recall'].append(recall) metrics['fscore'].append(fscore) metrics['support'].append(support) print(classification_report(y_test, y_pred)) matrix = confusion_matrix(y_test, y_pred) methods.saveConfusionMatrix(matrix, ('confusion_matrix_randomforest_distances_' + str(counter) +'.png')) counter = counter+1 meanAcc= round(np.mean(np.asarray(metrics['accuracy'])),2)*100 print('meanAcc: ', meanAcc) 的混淆矩阵?如果没有,我的方法是否正确?

编辑:我刚刚测试了以下内容:

best_estimator

这会在gs = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1) gs.fit(X_Distances, Y) 处产生best_score = 0.5362903225806451。当我在索引28处检查3倍的准确度时,我得到:

  1. split0:0.5185929648241207
  2. split1:0.526686807653575
  3. split2:0.5637651821862348
  4. 这导致平均测试准确度:0.5362903225806451。 best_params:best_index = 28

    现在我运行这个代码,它使用上面提到的best_params和一个分层的3倍分割(如GridSearchCV):

    {'criterion': 'entropy', 'max_depth': 21, 'min_samples_leaf': 5}

    指标dictionairy产生完全相同的准确度(split0:0.5185929648241207,split1:0.526686807653575,split2:0.5637651821862348)

    然而,平均计算有点偏差:0.5363483182213101

0 个答案:

没有答案
相关问题