我有一个二进制分类问题,为此我选择了3种算法,即Logistic回归,SVM和Adaboost。我对它们中的每一个都使用了网格搜索和k折交叉验证,以找到最佳的超参数集。现在,基于准确性,准确性和召回率,我需要选择最佳模型。但是问题是我找不到任何合适的方法来检索这些信息。我的代码如下:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics.scorer import make_scorer
from sklearn import cross_validation
# TODO: Initialize the classifier
clfr_A = LogisticRegression(random_state=128)
clfr_B = SVC(random_state=128)
clfr_C = AdaBoostClassifier(random_state=128)
lr_param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
svc_param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma' : [0.001, 0.01, 0.1, 1]}
adb_param_grid = {'n_estimators' : [50,100,150,200,250,500],'learning_rate' : [.5,.75,1.0,1.25,1.5,1.75,2.0]}
# TODO: Make an fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score, beta = 0.5)
# TODO: Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
clfrs = [clfr_A, clfr_B, clfr_C]
params = [lr_param_grid, svc_param_grid, adb_param_grid]
for clfr, param in zip(clfrs, params):
grid_obj = GridSearchCV(clfr, param, cv=3, scoring=scorer, refit=True)
grid_fit = grid_obj.fit(features_raw, target_raw)
print grid_fit.best_estimator_
print grid_fit.cv_results_
问题是cv_results_
给出了很多信息,但是除了mean_test_score
之外我找不到其他相关信息。此外,我在那里看不到任何准确性,准确性或召回率相关指标。
我可以想到一种实现它的方法。我可以将for循环更改为如下所示:
score_params = ["accuracy", "precision", "recall"]
for clfr, param in zip(clfrs, params):
grid_obj = GridSearchCV(clfr, param, cv=3, scoring=scorer, refit=True)
grid_fit = grid_obj.fit(features_raw, target_raw)
best_clf = grid_fit.best_estimator_
for score in score_params:
print score,
print " : ",
print cross_val_score(best_clf, features_raw, target_raw, scoring=score, cv=3).mean()
但是还有更好的方法吗?似乎我对每个模型进行了多次操作。任何帮助表示赞赏。
答案 0 :(得分:3)
GridSearchCV正在做您所提供的。您将f_beta作为得分手,因此mean_test_score
将针对每个参数组合返回该f_beta的结果。
如果要访问其他指标,则需要明确告知GridSearchCV这样做。
GridSearchCV,支持多指标评分。因此,您可以在其中传递多种类型的得分手。为per documentation:
得分:字符串,可调用,列表/元组,字典或无,默认值:无
... ...
要评估多个指标,请给出(唯一)列表 字符串或以名称为键而可调用项为值的字典。
在此处查看此示例:
并将您的scoring
参数更改为:
scoring = {'Accuracy': 'accuracy',
'FBeta': make_scorer(fbeta_score, beta = 0.5),
# ... Add others here as you want.
}
但是现在,当您执行此操作时,还需要更改refit
参数。由于此处的不同指标将为参数组合提供不同类型的分数,因此您需要在重新拟合估算器时决定选择哪个。因此,从refit
for clfr, param in zip(clfrs, params):
grid_obj = GridSearchCV(clfr, param, cv=3, scoring=scorer, refit='FBeta')
...
...