R caret: values of $finalModel$predicted and values obtained by predict()

时间:2019-01-07 13:56:08

标签: r prediction r-caret

To illustrate the differences between $finalModel$predicted and the values computed by predict(), I set up the following code:

library(caret)
library(randomForest)

dat <- data.frame(target = c(2.5, 4.5, 6.1, 3.2, 2.2),
              A = c(1.3, 4.4, 5.5, 6.7, 8.1),
              B = c(44.5, 50.1, 23.7, 89.2, 10.5),
              C = c("A", "A", "B", "B", "B"))

control <- trainControl(method="repeatedcv", number=10, repeats=3,     search="grid", savePred =T)

tunegrid <- expand.grid(.mtry=c(1:3))

set.seed(42)
rf_gridsearch <- train(target ~ A + B + C, 
                   data = dat, 
                   method="rf",
                   ntree = 2500, 
                   metric= "RMSE", 
                   tuneGrid=tunegrid, 
                   trControl=control)

dat$pred_caret <- rf_gridsearch$finalModel$predicted

dat$pred <- predict(object = rf_gridsearch, newdata = dat[,2:4])
dat$pred2 <- predict(object = rf_gridsearch$finalModel, newdata = dat[,2:4])

This last line of this code gives the error message

Error in predict.randomForest(object = rf_gridsearch$finalModel, 
newdata = dat[,  : variables in the training data missing in newdata

How is it possible to use $finalModel with predict?

Why does the data in column dat$pred_caret differ from dat$pred? What is the difference between the 2 predictions?

1 个答案:

答案 0 :(得分:1)

已经有很多与此问题相关的问题。见

在SO上,在Question 1Question 2Question 3Question 4Question 5在Stats.SE。


作为Stats.SE的几个答案,dat$pred_caretdat$pred不同,因为predict.train使用了整个训练集,而predict.randomForest则使用了整个训练集

  

newdata-包含新数据的数据框或矩阵。 (注意:如果没有   在给定的条件下,返回对象中的包外预测。

其中rf_gridsearch$finalModel$predicted

基本相同
randomForest:::predict.randomForest(rf_gridsearch$finalModel)

因为rf_gridsearch$finalModelrandomForest类的对象。也就是说,没有提供newdata

对于错误,它与trainrandomForest对待数据的事实有关。这次不是要缩放或居中,而是要创建虚拟对象。具体来说,randomForest在寻找C变量(因数),而train创建了虚拟变量CB <- 1 * (C == "B")。因此,您可以使用{p>复制predict.train的结果

predict(object = rf_gridsearch$finalModel, 
        newdata = model.matrix(~ A + B + C, dat[, 2:4])[, -1])

其中

model.matrix(~ A + B + C, dat[, 2:4])[, -1]
#     A    B CB
# 1 1.3 44.5  0
# 2 4.4 50.1  0
# 3 5.5 23.7  1
# 4 6.7 89.2  1
# 5 8.1 10.5  1