Question

我试图将3个模型组合成一个整体模型：

型号1 - XGBoost
模型2 - RandomForest
模型3 - 逻辑回归

注意：这里的所有代码都使用了插入符号的train（）函数。

> Bayes_model

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 75305, 75305, 75306, 75305, 75306, 75307, ... 
Resampling results:

  ROC        Sens  Spec
  0.5831236  1     0   

>linear_cv_model

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 75306, 75305, 75305, 75306, 75306, 75305, ... 
Resampling results:

  ROC        Sens  Spec
  0.5776342  1     0   

>rf_model_best

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 75305, 75305, 75306, 75305, 75306, 75307, ... 
Resampling results:

  ROC        Sens  Spec
  0.5551996  1     0

单独地，3个模型在55-60范围内具有非常差的AUC，但是不是非常相关，所以我希望将它们合奏。这是R中的基本代码：

Bayes_pred = predict(Bayes_model,train,type="prob")[,2]
linear_pred = predict(linear_cv_model,train,type="prob")[,2]
rf_pred = predict(rf_model_best,train,type="prob")[,2]
stacked = cbind(Bayes_pred,linear_pred,rf_pred,train[,"target"])

因此，这会产生一个包含4列的数据框，三个模型预测和目标。我认为现在的想法是在这三个预测变量上运行另一个元模型，但是当我这样做时，无论我尝试使用XGBoost超参数的哪种组合，我得到的AUC为1，所以我知道出了问题。

这个设置在概念上是不正确的吗？

meta_model = train(target~ ., data = stacked,
               method = "xgbTree",
               metric = "ROC",
               trControl = trainControl(method = "cv",number = 10,classProbs = TRUE,
                                        summaryFunction = twoClassSummary
                                        ),
               na.action=na.pass,
               tuneGrid = grid
               )

结果：

>meta_model

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 75306, 75306, 75307, 75305, 75306, 75305, ... 
Resampling results:

  ROC  Sens  Spec
  1    1     1

我觉得在CV折叠中，完美的AUC绝对是数据错误的指示。当在这个元模型上尝试逻辑回归时，我也得到了完美的分离。它没有意义。

> summary(stacked)
   Bayes_pred       linear_pred         rf_pred        Target
 Min.   :0.01867   Min.   :0.02679   Min.   :0.00000   No :74869  
 1st Qu.:0.08492   1st Qu.:0.08624   1st Qu.:0.01587   Yes: 8804  
 Median :0.10297   Median :0.10339   Median :0.04762              
 Mean   :0.10520   Mean   :0.10522   Mean   :0.11076              
 3rd Qu.:0.12312   3rd Qu.:0.12230   3rd Qu.:0.07937              
 Max.   :0.50483   Max.   :0.25703   Max.   :0.88889

我知道这不是可重现的代码，但我认为这是一个不依赖数据集的问题。如上所示，我有三个不同的预测，当然也没有单独的AUC值。结合我应该看到一些改进，但不是完美的分离。

编辑：使用T. Scharf的非常有用的建议，这里是我如何抓住在元模型中使用的折叠预测。预测将存储在＆＃34; pred＆＃34;下的模型中，但预测不是原始顺序。您需要重新排序它们才能正确堆叠。

使用dplyr的arrange（）函数，这就是我对贝叶斯＆＃39;的预测。模型：

Bayes_pred = arrange(as.data.frame(Bayes_model$pred)[,c("Yes","rowIndex")],rowIndex)[,1]

就我而言，＆＃34; Bayes_model＆＃34;是插入符号列车的对象和＆＃34;是＆＃34;是我正在建模的目标类。

Answer 1

这是发生了什么

当你这样做时

Bayes_pred = predict(Bayes_model,train,type="prob")[,2]
linear_pred = predict(linear_cv_model,train,type="prob")[,2]
rf_pred = predict(rf_model_best,train,type="prob")[,2]

这就是问题

您需要 out of fold 预测或测试预测作为训练元模型的输入。

您目前正在使用经过培训的模型以及您训练过的数据。这将产生过于乐观的预测，你现在正在为元模型提供培训。

一个好的经验法则是永远不要用模型调用数据预测已经看过那些数据，没有任何好处可以发生。

以下是您需要做的事情：

当您训练最初的3个模型时，请使用method = cv和savePredictions = TRUE这将保留可用于训练元模型的折叠预测。

为了说服自己，您对元模型的输入数据非常乐观，请为此对象的3列计算单个AUC：

stacked = cbind(Bayes_pred,linear_pred,rf_pred,train[,"target"])

与目标相比---它们会非常高，这就是为什么你的元模型如此优秀。它使用了过于好的输入。

希望这有帮助，元建模很难......

集合模型预测AUC 1

1 个答案: