Question

我遇到了一个包含randomForest的R包的错误，在我使用Caret将数据拆分为训练和测试之后，当我去预测我遇到错误时：

Error in predict.randomForest(randomForestFit, type = "response", newdata =testing$GEN) 
:number of variables in newdata does not match that in the training data

我将火车和测试之间的文件从完全相同的文件中分离出来。任何数据中都没有N / A或缺失值。下面是我的完整代码，但我不认为那里有错误。我不知道为什么会发生这种错误。任何想法将不胜感激！

library(caret)
require(foreign)

set.seed(825)
data <- read.spss("C:/MODEL_SAMPLE.sav",use.value.labels=TRUE, to.data.frame = TRUE)
inTraining <- createDataPartition(data$GEN, p = 0.75, list = FALSE)
training <- data[inTraining, ]
testing <- data[-inTraining, ]


library(randomForest)
library(foreach)

start.time <- Sys.time()

randomForestFit <- foreach(ntree=rep(63, 8), .combine=combine, .packages='randomForest')          
                    %dopar% randomForest(training[-201],
                                         training$GEN, 
                                         mtry = 40, 
                                         ntree=ntree,  
                                         verbose = TRUE, 
                                         importance = TRUE, 
                                         keep.forest=TRUE, 
                                         do.trace = TRUE)

randomForestFit

predict = predict(randomForestFit, type="response", newdata=testing$GEN)

stopCluster(cl)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

Answer 1

没有数据，任何人都很难说出问题究竟是什么。

三点建议：

首先，检查SPSS文件中的数据中的杂散字符。

其次，检查read.spss中的选项是否设置正确，尤其是： reencode = NA，use.missings = to.data.frame 。您可以使用后一个选项指定要转换为NA的非数字字符。

第三，使用 str（df），summary（df，useNA =“if any”）并确保包括响应在内的因子变量实际上是因子。将 as.numeric（as.character（））应用于数据框中的数值数据，如果数据框中存在VALUE！，＃NA等表达式，则会生成NA值。

您也可以从SPSS导出到csv并再次执行上述操作。

Answer 2

关键是下面

:number of variables in newdata does not match that in the training data

因此，我猜想训练和测试数据是不同的，尤其是列名。也许在这行中断了？

inTraining <- createDataPartition(data$GEN, p = 0.75, list = FALSE)

为了更好地理解问题，您可能必须发布三行训练和测试数据集（带有列名！）。

我希望这会有所帮助！

randomForest预测测试集的错误

2 个答案: