我正在使用RandomizedSearchCV
调整随机森林的超参数。获得良好的参数集后,我使用cross_validation.cross_val_score
评估模型。
我注意到RandomizedSearchCV
的分数与cross_validation.cross_val_score
的分数略有不同。 cross_val_score
的得分总是优于RandomizedSearchCV
。
from scipy.stats import randint as sp_randint
from sklearn.grid_search import RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from operator import itemgetter
def report(grid_scores, n_top=3, reverse=True):
top_scores = sorted(grid_scores, key=itemgetter(1), reverse=reverse)[:n_top]
for i, score in enumerate(top_scores):
print("Model with rank: {0}".format(i + 1))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
score.mean_validation_score,
np.std(score.cv_validation_scores)))
print("Parameters: {0}".format(score.parameters))
print("")
# get x and y
digits = load_digits()
X, y = digits.data, digits.target
# find good parameter set
param_dist = {'n_estimators': sp_randint(500, 2000)}
n_iter = 100
clf = RandomForestClassifier()
random_search_clf = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=n_iter_search, scoring="f1", n_jobs=2)
random_search_clf.fit(X, y)
report(random_search_clf.grid_scores_, 1, reverse=True) # print the score of top estimator
# evaluate with cross_validation
_param = random_search_clf.best_estimator_.get_params()
clf = RandomForestClassifier(**_param)
scores = cross_validation.cross_val_score(clf, X, y, cv=3, scoring=kappa_scorer, n_jobs=2, fit_params=fit_params)
print scores # score by cross_val_score
我的问题是为什么会发生这种情况以及哪些分数值得信赖。在上面的代码中,为什么report
方法打印的分数与print scores
不同?