我正在使用RandomForestClassifier
对GridSearchCV
进行参数调整。出于评估目的,我想要best_estimator
的混淆矩阵,据我所知,GridSearchCV
没有保存。
gs = GridSearchCV(RandomForestClassifier(n_estimators=1000, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1)
gs.fit(X_Distances, Y)
results = gs.cv_results_
我使用给定参数初始化gridsearch以接收best_parameters
。最后,我使用最佳参数来复制gridsearch的{{1}},并进行分层交叉验证。我假设我正在使用与gridSearchCV相同的训练/测试数据来训练/验证best_estimator
,因为我使用相同的参数和交叉验证选项(分层+3倍)。
best_estimator
我对过度拟合有些担忧。是否有更简单的方法来获取rf = RandomForestClassifier(n_estimators=1000, min_samples_leaf=7, max_depth=18, criterion='entropy', random_state=42)
accuracy = []
metrics = {'accuracy':[], 'precision':[], 'recall':[], 'fscore':[], 'support':[]}
counter = 0
print('################################################### RandomForest ###################################################')
skf = StratifiedKFold(n_splits=3, random_state=42, shuffle=False)
for train_index, test_index in skf.split(X_Distances,Y):
X_train, X_test = X_Distances[train_index], X_Distances[test_index]
y_train, y_test = Y[train_index], Y[test_index]
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
precision, recall, fscore, support = np.round(score(y_test, y_pred), 2)
metrics['accuracy'].append(round(accuracy_score(y_test, y_pred), 2))
metrics['precision'].append(precision)
metrics['recall'].append(recall)
metrics['fscore'].append(fscore)
metrics['support'].append(support)
print(classification_report(y_test, y_pred))
matrix = confusion_matrix(y_test, y_pred)
methods.saveConfusionMatrix(matrix, ('confusion_matrix_randomforest_distances_' + str(counter) +'.png'))
counter = counter+1
meanAcc= round(np.mean(np.asarray(metrics['accuracy'])),2)*100
print('meanAcc: ', meanAcc)
的混淆矩阵?如果没有,我的方法是否正确?
best_estimator
这会在gs = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1)
gs.fit(X_Distances, Y)
处产生best_score = 0.5362903225806451
。当我在索引28处检查3倍的准确度时,我得到:
这导致平均测试准确度:0.5362903225806451。 best_params:best_index = 28
现在我运行这个代码,它使用上面提到的best_params和一个分层的3倍分割(如GridSearchCV):
{'criterion': 'entropy', 'max_depth': 21, 'min_samples_leaf': 5}
指标dictionairy产生完全相同的准确度(split0:0.5185929648241207,split1:0.526686807653575,split2:0.5637651821862348)
然而,平均计算有点偏差:0.5363483182213101