更新: 我将简历创建为5,仍然存在相同的问题。
我想执行XGBoost的超参数调整。由于类的不平衡,我使用PR-AUC(average_precision)作为评估模型性能的分数。另外,我每次(RandomizedSearchCV)都会对一个/两个参数进行优化,以减少参数组合数。
我采取的步骤:
下面的第一段代码优化了子样本,结果显示最佳子样本为0.8。
然后我使用subsample = 0.8并优化colsample_bytree和max_depth。但是,最终性能略低于第一型号。 (注意:尽管我使用randomizedsearchCV,但确实通过调整param_comb来搜索所有参数空间。)
我希望第二个模型的优化会增加PR-AUC。 (第一个优化的最佳模型的超参数实际上包含在第二个优化的超参数空间中。) 我一直在考虑以下可能性:
1)交叉验证= 3可能还不够大吗?这可能会导致结果差异
2)random_state设置不正确?
# import libraries
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier
1st optimization:
params = {
'min_child_weight': [1],
'gamma': [0],
'subsample': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'colsample_bytree': [0.8],
'max_depth': [5],
'n_estimators': [600]
}
model1 = XGBClassifier(learning_rate=0.1, objective='binary:logistic', scale_pos_weight=1, seed=27,
silent=True, nthread=4)
param_comb = 7
skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=1001)
# X,Y Both refer to training datasets.
model1_rs = RandomizedSearchCV(model1, param_distributions=params, n_iter=param_comb, scoring='average_precision', n_jobs=-1, cv=skf.split(X,Y), verbose=3, random_state=1001)
model1_rs.fit(X, Y)
print(average_precision_score(y_test, model1_rs.best_estimator_.predict_proba(X_test)[:,1], average="micro"))
# 0.7302823843908489
print(average_precision_score(y_train, model1_rs.predict_proba(X_train)[:,1], average="micro"))
# 0.7827743151047564
# 2nd optimization
params = {
'min_child_weight': [1],
'gamma': [0],
'subsample': [0.8], # model 1 suggest 0.8 is the best as shown model1_rs.best_params_
'colsample_bytree': [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'max_depth': [3, 4, 5, 6],
'n_estimators': [600]
}
model2= XGBClassifier(learning_rate=0.1, objective='binary:logistic', scale_pos_weight=1, seed=27,
silent=True, nthread=4)
param_comb = 32 # there are 8*4 combination in the paramater space.
skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=1001)
model2_rs = RandomizedSearchCV(model2, param_distributions=params, n_iter=param_comb, scoring='average_precision', n_jobs=-1, cv=skf.split(X,Y), verbose=3, random_state=1001)
model2_rs.fit(X, Y)
print(average_precision_score(y_test, model2_rs.best_estimator_.predict_proba(X_test)[:,1], average="micro"))
# 0.7302083383917825
print(average_precision_score(y_train, model2_rs.predict_proba(X_train)[:,1], average="micro"))
# 0.7741629881312448