Question

我正试图在this dataset上运行5倍XGBoost模型。当我运行以下代码时：

  train_control<- trainControl(method="cv", 
                           search = "random", 
                           number=5,
                           verboseIter=TRUE)

  # Train Models 
  xgb.mod<- train(Vote_perc~.,
              data=forkfold, 
              trControl=train_control, 
              method="xgbTree", 
              family=binomial())

我收到警告：

Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

此外，“预测”功能运行，但所有预测都是相同的数字。我怀疑这只是一个拦截模型，但我不确定。当我删除

时

search="random"

参数，它运行正常。我想运行随机搜索，以便我可以隔离哪些超参数可能最有效，但每次尝试时，我都会收到警告。我错过了什么？谢谢！

Answer 1

以下是您可以使用数据执行的一种方法：

加载数据：

forkfold  <- read.csv("forkfold.csv", row.names = 1)

这里的问题是结果变量在97％的情况下为0，而在剩余的3％中它非常接近于零。

length(forkfold$Vote_perc)
#output
7069

sum(forkfold$Vote_perc != 0)
#output 
212

您将其描述为分类问题，我会将其转换为二元问题：

forkfold$Vote_perc <- ifelse(forkfold$Vote_perc != 0,
                             "one",
                             "zero")

由于该集合使用Accuracy高度不平衡，因为选择指标是不可能的。在这里，我将尝试通过定义自定义评估函数来最大化Sensitivity + Specificity here：

fourStats <- function (data, lev = levels(data$obs), model = NULL) {
  out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
  coords <- matrix(c(1, 1, out["Spec"], out["Sens"]), 
                   ncol = 2, 
                   byrow = TRUE)
  colnames(coords) <- c("Spec", "Sens")
  rownames(coords) <- c("Best", "Current")
  c(out, Dist = dist(coords)[1])
}

我将在trainControl：

中指定此功能

train_control <- trainControl(method = "cv", 
                              search = "random", 
                              number = 5,
                              verboseIter=TRUE,
                              classProbs = T,
                              savePredictions = "final",
                              summaryFunction = fourStats)

set.seed(1)
xgb.mod <- train(Vote_perc~.,
                 data = forkfold, 
                 trControl = train_control, 
                 method = "xgbTree", 
                 tuneLength = 50,
                 metric = "Dist",
                 maximize = FALSE,
                 scale_pos_weight = sum(forkfold$Vote_perc == "zero")/sum(forkfold$Vote_perc == "one"))

我将在Dist摘要函数中使用之前定义的fourStats指标。应将此指标最小化，以便maximize = FALSE。我将在调谐空间上使用随机搜索，并将测试50个随机的超参数值集（tuneLength = 50）。

我还设置了xgboost函数的scale_pos_weight参数。在?xgboost的帮助下：

scale_pos_weight，[default = 1]控制正和平衡负权重，对非平衡类有用。一个典型的价值考虑：sum（负数情况）/ sum（正数情况）参见参数调整更多讨论。另见Higgs Kaggle比赛演示例子：R，py1，py2，py3

我将其定义为推荐sum(negative cases) / sum(positive cases)

在模型训练之后，它将选择一些最小化Dist的炒作参数。

评估保留预测中的混淆矩阵：

caret::confusionMatrix(xgb.mod$pred$pred, xgb.mod$pred$obs)

Confusion Matrix and Statistics

          Reference
Prediction  one zero
      one   195  430
      zero   17 6427

               Accuracy : 0.9368          
                 95% CI : (0.9308, 0.9423)
    No Information Rate : 0.97            
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.4409          
 Mcnemar's Test P-Value : <2e-16          

            Sensitivity : 0.91981         
            Specificity : 0.93729         
         Pos Pred Value : 0.31200         
         Neg Pred Value : 0.99736         
             Prevalence : 0.02999         
         Detection Rate : 0.02759         
   Detection Prevalence : 0.08841         
      Balanced Accuracy : 0.92855         

       'Positive' Class : one

我说它没有那么糟糕。

如果您调整预测的截止阈值，则可以做得更好，在调整过程中如何执行此操作[{3}}。您还可以使用折叠预测来调整截止阈值。在这里，我将展示如何使用pROC库：

library(pROC)

plot(roc(xgb.mod$pred$obs, xgb.mod$pred$one),
     print.thres = TRUE)

here

图片上显示的阈值最大化Sens + Spec：

使用此阈值评估折扣表现：

caret::confusionMatrix(ifelse(xgb.mod$pred$one > 0.369, "one", "zero"),
                       xgb.mod$pred$obs)
#output
Confusion Matrix and Statistics

          Reference
Prediction  one zero
      one   200  596
      zero   12 6261

               Accuracy : 0.914           
                 95% CI : (0.9072, 0.9204)
    No Information Rate : 0.97            
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.3668          
 Mcnemar's Test P-Value : <2e-16          

            Sensitivity : 0.94340         
            Specificity : 0.91308         
         Pos Pred Value : 0.25126         
         Neg Pred Value : 0.99809         
             Prevalence : 0.02999         
         Detection Rate : 0.02829         
   Detection Prevalence : 0.11260         
      Balanced Accuracy : 0.92824         

       'Positive' Class : one

在212个非零实体中，您检测到200个。

为了更好地执行，您可以尝试预处理数据。或者使用更好的超参数搜索例程，例如用于的mlrMBO包。或者也许改变学习者（我怀疑你可以在这里顶尖xgboost）。

还有一点需要注意，如果获得高灵敏度可能不是最重要的，也许使用＆＃34; Kappa＆＃34;因为选择指标可能会提供更令人满意的模型。

最后一点，我们可以使用已经选择的参数检查模型的性能，使用默认的scale_pos_weight = 1：

set.seed(1)
xgb.mod2 <- train(Vote_perc~.,
                  data = forkfold, 
                  trControl = train_control, 
                  method = "xgbTree", 
                  tuneGrid = data.frame(nrounds = 498,
                                        max_depth = 3,
                                        eta = 0.008833468,
                                        gamma = 4.131242,
                                        colsample_bytree = 0.4233169,
                                        min_child_weight = 3,
                                        subsample = 0.6212512),
                  metric = "Dist",
                  maximize = FALSE,
                  scale_pos_weight = 1)

caret::confusionMatrix(xgb.mod2$pred$pred, xgb.mod2$pred$obs)
#output
Confusion Matrix and Statistics

          Reference
Prediction  one zero
      one    94   21
      zero  118 6836

               Accuracy : 0.9803          
                 95% CI : (0.9768, 0.9834)
    No Information Rate : 0.97            
    P-Value [Acc > NIR] : 3.870e-08       

                  Kappa : 0.5658          
 Mcnemar's Test P-Value : 3.868e-16       

            Sensitivity : 0.44340         
            Specificity : 0.99694         
         Pos Pred Value : 0.81739         
         Neg Pred Value : 0.98303         
             Prevalence : 0.02999         
         Detection Rate : 0.01330         
   Detection Prevalence : 0.01627         
      Balanced Accuracy : 0.72017         

       'Positive' Class : one

在默认阈值0.5下更糟糕。

和最佳阈值：

plot(roc(xgb.mod2$pred$obs, xgb.mod2$pred$one),
     print.thres = TRUE)

mlr

0.037与我们根据建议设置scale_pos_weight时获得的0.369相比。然而，在最佳阈值下，两种方法都会产生相同的预测。

{caret} xgTree：重采样性能指标中缺少值

1 个答案: