To illustrate the differences between $finalModel$predicted
and the values computed by predict()
, I set up the following code:
library(caret)
library(randomForest)
dat <- data.frame(target = c(2.5, 4.5, 6.1, 3.2, 2.2),
A = c(1.3, 4.4, 5.5, 6.7, 8.1),
B = c(44.5, 50.1, 23.7, 89.2, 10.5),
C = c("A", "A", "B", "B", "B"))
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid", savePred =T)
tunegrid <- expand.grid(.mtry=c(1:3))
set.seed(42)
rf_gridsearch <- train(target ~ A + B + C,
data = dat,
method="rf",
ntree = 2500,
metric= "RMSE",
tuneGrid=tunegrid,
trControl=control)
dat$pred_caret <- rf_gridsearch$finalModel$predicted
dat$pred <- predict(object = rf_gridsearch, newdata = dat[,2:4])
dat$pred2 <- predict(object = rf_gridsearch$finalModel, newdata = dat[,2:4])
This last line of this code gives the error message
Error in predict.randomForest(object = rf_gridsearch$finalModel,
newdata = dat[, : variables in the training data missing in newdata
How is it possible to use $finalModel
with predict?
Why does the data in column dat$pred_caret
differ from dat$pred
? What is the difference between the 2 predictions?
答案 0 :(得分:1)
已经有很多与此问题相关的问题。见
在SO上,在Question 1,Question 2,Question 3,Question 4,Question 5在Stats.SE。
作为Stats.SE的几个答案,dat$pred_caret
与dat$pred
不同,因为predict.train
使用了整个训练集,而predict.randomForest
则使用了整个训练集>
newdata-包含新数据的数据框或矩阵。 (注意:如果没有 在给定的条件下,返回对象中的包外预测。
其中rf_gridsearch$finalModel$predicted
与
randomForest:::predict.randomForest(rf_gridsearch$finalModel)
因为rf_gridsearch$finalModel
是randomForest
类的对象。也就是说,没有提供newdata
。
对于错误,它与train
和randomForest
对待数据的事实有关。这次不是要缩放或居中,而是要创建虚拟对象。具体来说,randomForest
在寻找C
变量(因数),而train
创建了虚拟变量CB <- 1 * (C == "B")
。因此,您可以使用{p>复制predict.train
的结果
predict(object = rf_gridsearch$finalModel,
newdata = model.matrix(~ A + B + C, dat[, 2:4])[, -1])
其中
model.matrix(~ A + B + C, dat[, 2:4])[, -1]
# A B CB
# 1 1.3 44.5 0
# 2 4.4 50.1 0
# 3 5.5 23.7 1
# 4 6.7 89.2 1
# 5 8.1 10.5 1