from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier
import time
params = {
'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5]
}
xgb = XGBClassifier(learning_rate=0.02, n_estimators=600,
silent=True, nthread=1)
folds = 5
param_comb = 5
skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)
random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring=['f1_macro','precision_macro'], n_jobs=4, cv=skf.split(X_train,y_train), verbose=3, random_state=1001)
start_time = time.clock() # timing starts from this point for "start_time" variable
random_search.fit(X_train, y_train)
elapsed = (time.clock() - start) # timing ends here for "start_time"
variable
我的代码在上面,我的y_train是一个带有多类的pandas系列,整数从0到9。
y_train.head()
1041 8
1177 7
2966 0
1690 2
2115 1
Name: Industry, dtype: object
运行上面的设置代码后,我收到错误消息:
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'unknown' instead.
我对其他类似问题进行了一些搜索,我尝试使用cross_validate
中的sklearn.model_selection
并尝试使用与多类兼容的其他指标但仍然得到相同的错误消息。
我是否可以根据性能指标对分层交叉验证进行gridsearch?
更新:解决dtype
问题后,我想将多个指标传递到scoring=
,我试过这种方式,因为我读了这个文档(http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter):
random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring=['f1_macro','precision_macro'], n_jobs=4, cv=skf.split(X_train,y_train), verbose=3, random_state=1001)
然后我失败了以下警告:
ValueError Traceback (most recent call
last)
<ipython-input-67-dd57cd97c89c> in <module>()
36 # Here we go
37 start_time = time.clock() # timing starts from this point for
"start_time" variable
---> 38 random_search.fit(X_train, y_train)
39 elapsed = (time.clock() - start) # timing ends here for "start_time" variable
/anaconda3/lib/python3.6/site-
packages/sklearn/model_selection/_search.py in fit(self, X, y, groups,
**fit_params)
609 "available for that metric. If
this is not "
610 "needed, refit should be set to
False "
--> 611 "explicitly. %r was passed." %
self.refit)
612 else:
613 refit_metric = self.refit
ValueError: For multi-metric scoring, the parameter refit must be set
to a scorer key to refit an estimator with the best parameter setting
on the whole data and make the best_* attributes available for that
metric. If this is not needed, refit should be set to False explicitly.
True was passed.
如何解决此问题?
答案 0 :(得分:1)
正如here in user guide所写:
指定多个指标时,必须将refit参数设置为 将找到并使用best_params_的度量(字符串) 在整个数据集上构建best_estimator_。如果搜索 不应该改装,设置refit = False。将改装保留为默认值 value使用多个指标时,None将导致错误。
由于您在此处使用多个指标:
random_search = RandomizedSearchCV(xgb, param_distributions=params,
n_iter=param_comb,
scoring=['f1_macro','precision_macro'],
n_jobs=4,
cv=skf.split(X_train,y_train),
verbose=3, random_state=1001)
RandomizedSearchCV不知道如何找到最佳参数。它不能从两种不同的评分策略中选择最佳分数。因此,您需要指定希望它用于查找最佳参数的评分类型。
为此,您需要将refit
参数设置为您在scoring
中使用的其中一个选项。像这样:
random_search = RandomizedSearchCV(xgb, param_distributions=params,
...
scoring=['f1_macro','precision_macro'],
...
refit = 'f1_macro')