Question

我正在尝试使用XGBoost交叉验证来进行参数调整，如下所示：https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f。区别在于我的问题是二进制分类问题。

问题在于它似乎过拟合，我认为交叉验证会阻止这种情况。我增加max_depth越多，我的AUC就越“好”。我已经尝试过6到14（乘2）的值。 dtrain有超过800万个样本和20个功能。

这是我的代码段：

num_boost_round = 25

params = {
        #parameters to tune   #defaults
        'max_depth':12,        #6
        'min_child_weight':3, #1
        'eta':0.3,            #0.3
        'subsample':1,        #1
        'colsample_bytree':1, #1
        #other parameters
        'objective':'binary:logistic',
}
params['eval_metric'] = 'auc'

gridsearch_params = [
      (max_depth, min_child_weight)
      for max_depth in range(16,23,2)
      for min_child_weight in range(3,4,1)
]

for max_depth, min_child_weight in gridsearch_params:
      print("CV with max_depth={}, min_child_weight={}".format(max_depth, min_child_weight))

      #update our params
      params['max_depth'] = max_depth
      params['min_child_weight'] = min_child_weight

      #Run CV
      cv_results = xgb.cv(params,
                    dtrain,
                    num_boost_round,
                    nfold=5,
                    early_stopping_rounds=10,
                    metrics="auc", 
                    maximize=True,
                    as_pandas=True,
                    seed=123)

      mean_auc = cv_results['test-auc-mean'].max()
      boost_rounds = cv_results['test-auc-mean'].argmax()
      print("AUC {} for {} rounds".format(mean_auc, boost_rounds))

还有一些输出：

CV with max_depth=16, min_child_weight=3
AUC 0.9779078 for 24 rounds
CV with max_depth=18, min_child_weight=3
AUC 0.9856856 for 24 rounds
CV with max_depth=20, min_child_weight=3
AUC 0.9898591999999999 for 24 rounds
CV with max_depth=22, min_child_weight=3
AUC 0.991723 for 24 rounds

使用XGBoost训练方法和80％/ 20％训练/测试，我得到的AUC为0.86，这似乎对这个问题是正确的。所以我不确定这是怎么回事。我觉得我缺少一些非常简单的东西。我的xgb.cv命令正确吗？有没有看起来有些参数的东西？

xgboost cv函数过度拟合

0 个答案: