Question

我正在使用xgboost来构建模型。数据集只有200行和10000列。

我尝试使用chi-2获得100个cols，但我的混淆矩阵看起来像这样：

    1 0
1 190 0
0  10 0

我尝试使用10000个属性，随机选择100个属性，根据chi-2选择100个属性，但我从未得到0个案例预测。是因为数据集，还是因为我使用xgboost的方式？

我的因子（pred.cv）总是只显示1个级别，而因子（y + 1）的级别为1或2。

param <- list("objective" = "binary:logistic",
          "eval_metric" = "error",
          "nthread" = 2,
          "max_depth" = 5,
          "eta" = 0.3,
          "gamma" = 0,
          "subsample" = 0.8,
          "colsample_bytree" = 0.8,
          "min_child_weight" = 1,
          "max_delta_step"= 5,
          "learning_rate" =0.1,
          "n_estimators" = 1000,
          "seed"=27,
          "scale_pos_weight" = 1
          )
nfold=3
nrounds=200
pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/1, ncol=1)
pred.cv = max.col(pred.cv, "last")
factor(y+1) # this is the target in train, level 1 and 2
factor(pred.cv) # this is the issue, it is always only 1 level

Answer 1

我发现插入符很慢，如果不构建自定义模型，它就无法调整xgboost模型的所有参数，而自定义模型比使用自定义函数进行评估要复杂得多。

然而，如果你正在进行一些上/下采样或者smote / rose caret是要走的路，因为它在模型评估阶段（重新采样期间）正确地将它们合并。请参阅：https://topepo.github.io/caret/subsampling-for-class-imbalances.html

然而，我发现这些技术对结果的影响非常小，而且通常情况更糟，至少在我训练的模型中。

scale_pos_weight给予某个班级更高的分数，如果少数族群的丰富度为10％，那么在scale_pos_weight附近玩5 - 10应该是有益的。

调整正则化参数对于xgboost非常有用：这里有一个参数：alpha，beta和gamma - 我发现有效值为0 - 3.其他有用的参数增加直接正则化（通过增加不确定性）的是subsample，colsample_bytree和colsample_bylevel。我发现玩colsample_bylevel也可以对模型产生积极的结果。您已使用subsample和colsample_bytree。

我会测试一个更小的eta和更多的树木，看看模型是否有益。在这种情况下，early_stopping_rounds轮可以加快进程。

其他eval_metric可能比准确性更有益。试试logloss或auc甚至map和ndcg

这是一个用于超参数网格搜索的函数。它使用auc作为评估指标，但可以轻松更改

xgb.par.opt=function(train, seed){
  require(xgboost)
  ntrees=2000
  searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1), 
                                  colsample_bytree = c(0.6, 0.8, 1),
                                  gamma = c(0, 1, 2),
                                  eta = c(0.01, 0.03),
                                  max_depth = c(4,6,8,10))
  aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){

    #Extract Parameters to test
    currentSubsampleRate <- parameterList[["subsample"]]
    currentColsampleRate <- parameterList[["colsample_bytree"]]
    currentGamma <- parameterList[["gamma"]]
    currentEta =parameterList[["eta"]]
    currentMaxDepth =parameterList[["max_depth"]]
    set.seed(seed)

    xgboostModelCV <- xgb.cv(data = train, 
                             nrounds = ntrees,
                             nfold = 5,
                             objective = "binary:logistic",
                             eval_metric= "auc",
                             metrics = "auc",
                             verbose = 1,
                             print_every_n = 50,
                             early_stopping_rounds = 200,
                             stratified = T,
                             scale_pos_weight=sum(all_data[train,1]==0)/sum(all_data[train,1]==1),
                             max_depth = currentMaxDepth, 
                             eta = currentEta, 
                             gamma = currentGamma,
                             colsample_bytree = currentColsampleRate,
                             min_child_weight = 1,
                             subsample =  currentSubsampleRate
                             seed = seed) 


    xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)

    auc = xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
    auc = cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta,  currentMaxDepth)
    names(auc) = c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
    print(auc)
    return(auc)
  })
  return(aucErrorsHyperparameters)
}

可以在expand.grid电话中添加其他参数。

我通常在一次CV重复上训练超参数，并在与其他种子或验证集上进行额外重复时对其进行评估（但应谨慎使用验证集以避免过度拟合）

Answer 2

测试

param <- list("objective" = "binary:logistic",
      "eval_metric" = "error",
      "nthread" = 2,
      "max_depth" = 5,
      "eta" = 0.3,
      "gamma" = 0,
      "subsample" = 0.8,
      "colsample_bytree" = 0.8,
      "min_child_weight" = 1,
      "max_delta_step"= 5,
      "learning_rate" =0.1,
      "n_estimators" = 1000,
      "seed"=27,
      "scale_pos_weight" = 1
      )
nfold=3
nrounds=200
pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/1, ncol=1)
pred.cv = max.col(pred.cv, "last")
factor(y+1) # this is the target in train, level 1 and 2
factor(pred.cv) # this is the issue, it is always only 1 level

xgboost始终使用不平衡数据集预测1级

2 个答案: