我已经使用GridSearch来解决分类问题:
# A parameter grid for XGBoost
params = {
'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5]
}
# fit model no training data
xgb = XGBClassifier(learning_rate=0.02, n_estimators=600,
objective='binary:logistic',
silent=True, nthread=1)
folds = 3
param_comb = 5
skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)
random_search = RandomizedSearchCV(xgb, param_distributions=params,
n_iter=param_comb, scoring='roc_auc',
n_jobs=4,
cv=skf.split(X_train_resampled,y_train_resampled), verbose=3,
random_state=1001 )
random_search.fit(X_train_resampled, y_train_resampled)
print('\n Best hyperparameters:')
print(random_search.best_params_)
print('\n Best estimator:')
print(random_search.best_estimator_)
之后,我得到了:
最佳超参数:{'subsample':0.6,'min_child_weight':1, 'max_depth':5,'gamma':1.5,'colsample_bytree':0.8}
最佳估算器:XGBClassifier(base_score = 0.5,booster ='gbtree', colsample_bylevel = 1, colsample_bytree = 0.8,gamma = 1.5,learning_rate = 0.02, max_delta_step = 0,max_depth = 5,min_child_weight = 1,missing = None, n_estimators = 600,n_jobs = 1,nthread = 1,objective ='binary:logistic', random_state = 0,reg_alpha = 0,reg_lambda = 1,scale_pos_weight = 1, seed = None,silent = True,subsample = 0.6)
最佳ROC AUC得分= 0.9719630276538562。比我运行过分类器:
model=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=0.8, gamma=1.5, learning_rate=0.02,
max_delta_step=0, max_depth=5, min_child_weight=1, missing=None,
n_estimators=600, n_jobs=4, objective='binary:logistic',
random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
seed=None, silent=True, subsample=0.6)
model.fit(X_train_resampled, y_train_resampled)
# make predictions for test data
predictions = model.predict_proba(X_test_scaled)[:, 1]
# evaluate predictions
print ('ROC AUC Score',roc_auc_score(y_test,predictions))
我已经阅读了最近的主题(What is the difference between cross_val_score with scoring='roc_auc' and roc_auc_score?),但问题仍然存在。我使用了predict_proba并获得了ROC AUC分数0.791423604769。
为什么有这种区别?有什么建议吗?在开始分类器之前,我正在进行缩放和重采样,但是具有固定的随机状态-与gridsearch相同。