R-Caret:如何使用多个模型构建更高效的模型并预测新结果

时间:2015-03-19 11:24:48

标签: r machine-learning r-caret

我的培训数据集(列车)是一个包含 n-features 的数据框,另一列是结果 y 。我建立了3个人模型,例如:

m1 <- train(y ~ ., data = train, method = "lda")
m2 <- train(y ~ ., data = train, method = "rf")
m3 <- train(y ~ ., data = train, method = "gbm")

使用测试数据集(测试),我可以评估这些个体模型的质量(当然,它的结果 y ):

pred1 <- predict(m1, newdata = test)
pred2 <- predict(m2, newdata = test)
pred3 <- predict(m3, newdata = test)

如果我使用5个示例在数据框 DATA_TO_PREDICT (结果未知)中应用每个单独的模型,则输出自然是每个模型的5个预测:

predict(m1, DATA_TO_PREDICT)
predict(m2, DATA_TO_PREDICT)
predict(m3, DATA_TO_PREDICT)

现在我想使用R-Caret-Package与Random Forest的组合模型:

DF <- data.frame(pred1, pred2, pred3, y = test$y)
MODEL <- train(y ~ ., data = DF, method = "rf")

我可以观察到组合模型的准确性增加了:

predMODEL <- predict(MODEL, DF)

但是如果我在 DATA_TO_PREDICT 中应用组合模型(结果未知),则输出不仅有5个预测,而是具有重复结果且大于100的巨大列表。我用过:

predict(MODEL, newdata = DATA_TO_PREDICT)

实施例

这里我展示了输出错误的具体示例。也就是说,我想预测4个新数据,但我得到了几十个输出的结果:

library(caret)
library(gbm)
set.seed(10)
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]

inTEST <- (5:nrow(testing))
test <- testing[inTEST,]
DATA_TO_PREDICT <- testing[-inTEST,]

m1 <- train(diagnosis ~ ., data=training, method="rf")
m2 <- train(diagnosis ~ ., data=training, method="gbm")
m3 <- train(diagnosis ~ ., data=training, method="lda")
p1 <- predict(m1, newdata = test)
p2 <- predict(m2, newdata = test)
p3 <- predict(m3, newdata = test)

DF <- data.frame(p1, p2, p3, diagnosis = test$diagnosis)
MODEL <- train(diagnosis ~ ., data = DF, method = "rf")
predMODEL <- predict(MODEL, DF)

然后,如果我构建了组合模型:

pred1 <- predict(m1, DATA_TO_PREDICT)
pred2 <- predict(m2, DATA_TO_PREDICT)
pred3 <- predict(m3, DATA_TO_PREDICT)
DF2 <- data.frame(pred1, pred2, pred3)
predict(MODEL, newdata = DF2) 

请注意,DATA_TO_PREDICT只有4个示例,输出为:

  [1] Control Control Control Control Control Control Control Control
  [9] Control Control Control Control Control Control Control Control
 [17] Control Control Control Control Control Control Control Control
 [25] Control Control Control Control Control Control Control Control
 [33] Control Control Control Control Control Control Control Control
 [41] Control Control Control Control Control Control Control Control
 [49] Control Control Control Control Control Control Control Control
 [57] Control Control Control Control Control Control Control Control
 [65] Control Control Control Control Control Control Control Control
 [73] Control Control Control Control Control Control
 Levels: Impaired Control

1 个答案:

答案 0 :(得分:2)

这是因为MODEL已针对三个单独模型(pred1pred2pred3的测试数据)的预测进行了培训,并且在最后一步{ {1}}提供给DATA_TO_PREDICT,而MODEL由观察组成。首先,必须存储DATA_TO_PREDICT的各个模型的预测值,然后将其用作newdata的{​​{1}}。

MODEL