xgboost始终使用不平衡数据集预测1级

时间:2017-10-24 10:28:02

标签: r xgboost

我正在使用xgboost来构建模型。数据集只有200行和10000列。

我尝试使用chi-2获得100个cols,但我的混淆矩阵看起来像这样:

    1 0
1 190 0
0  10 0

我尝试使用10000个属性,随机选择100个属性,根据chi-2选择100个属性,但我从未得到0个案例预测。是因为数据集,还是因为我使用xgboost的方式?

我的因子(pred.cv)总是只显示1个级别,而因子(y + 1)的级别为1或2。

param <- list("objective" = "binary:logistic",
          "eval_metric" = "error",
          "nthread" = 2,
          "max_depth" = 5,
          "eta" = 0.3,
          "gamma" = 0,
          "subsample" = 0.8,
          "colsample_bytree" = 0.8,
          "min_child_weight" = 1,
          "max_delta_step"= 5,
          "learning_rate" =0.1,
          "n_estimators" = 1000,
          "seed"=27,
          "scale_pos_weight" = 1
          )
nfold=3
nrounds=200
pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/1, ncol=1)
pred.cv = max.col(pred.cv, "last")
factor(y+1) # this is the target in train, level 1 and 2
factor(pred.cv) # this is the issue, it is always only 1 level

2 个答案:

答案 0 :(得分:0)

我发现插入符很慢,如果不构建自定义模型,它就无法调整xgboost模型的所有参数,而自定义模型比使用自定义函数进行评估要复杂得多。

然而,如果你正在进行一些上/下采样或者smote / rose caret是要走的路,因为它在模型评估阶段(重新采样期间)正确地将它们合并。请参阅:https://topepo.github.io/caret/subsampling-for-class-imbalances.html

然而,我发现这些技术对结果的影响非常小,而且通常情况更糟,至少在我训练的模型中。

scale_pos_weight给予某个班级更高的分数,如果少数族群的丰富度为10%,那么在scale_pos_weight附近玩5 - 10应该是有益的。

调整正则化参数对于xgboost非常有用:这里有一个参数:alphabetagamma - 我发现有效值为0 - 3.其他有用的参数增加直接正则化(通过增加不确定性)的是subsamplecolsample_bytreecolsample_bylevel。我发现玩colsample_bylevel也可以对模型产生积极的结果。您已使用subsamplecolsample_bytree

我会测试一个更小的eta和更多的树木,看看模型是否有益。在这种情况下,early_stopping_rounds轮可以加快进程。

其他eval_metric可能比准确性更有益。试试loglossauc甚至mapndcg

这是一个用于超参数网格搜索的函数。它使用auc作为评估指标,但可以轻松更改

xgb.par.opt=function(train, seed){
  require(xgboost)
  ntrees=2000
  searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1), 
                                  colsample_bytree = c(0.6, 0.8, 1),
                                  gamma = c(0, 1, 2),
                                  eta = c(0.01, 0.03),
                                  max_depth = c(4,6,8,10))
  aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){

    #Extract Parameters to test
    currentSubsampleRate <- parameterList[["subsample"]]
    currentColsampleRate <- parameterList[["colsample_bytree"]]
    currentGamma <- parameterList[["gamma"]]
    currentEta =parameterList[["eta"]]
    currentMaxDepth =parameterList[["max_depth"]]
    set.seed(seed)

    xgboostModelCV <- xgb.cv(data = train, 
                             nrounds = ntrees,
                             nfold = 5,
                             objective = "binary:logistic",
                             eval_metric= "auc",
                             metrics = "auc",
                             verbose = 1,
                             print_every_n = 50,
                             early_stopping_rounds = 200,
                             stratified = T,
                             scale_pos_weight=sum(all_data[train,1]==0)/sum(all_data[train,1]==1),
                             max_depth = currentMaxDepth, 
                             eta = currentEta, 
                             gamma = currentGamma,
                             colsample_bytree = currentColsampleRate,
                             min_child_weight = 1,
                             subsample =  currentSubsampleRate
                             seed = seed) 


    xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)

    auc = xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
    auc = cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta,  currentMaxDepth)
    names(auc) = c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
    print(auc)
    return(auc)
  })
  return(aucErrorsHyperparameters)
}

可以在expand.grid电话中添加其他参数。

我通常在一次CV重复上训练超参数,并在与其他种子或验证集上进行额外重复时对其进行评估(但应谨慎使用验证集以避免过度拟合)

答案 1 :(得分:0)

测试

param <- list("objective" = "binary:logistic",
      "eval_metric" = "error",
      "nthread" = 2,
      "max_depth" = 5,
      "eta" = 0.3,
      "gamma" = 0,
      "subsample" = 0.8,
      "colsample_bytree" = 0.8,
      "min_child_weight" = 1,
      "max_delta_step"= 5,
      "learning_rate" =0.1,
      "n_estimators" = 1000,
      "seed"=27,
      "scale_pos_weight" = 1
      )
nfold=3
nrounds=200
pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/1, ncol=1)
pred.cv = max.col(pred.cv, "last")
factor(y+1) # this is the target in train, level 1 and 2
factor(pred.cv) # this is the issue, it is always only 1 level