Question

我在R中使用样本验证数据运行randomForest模型：

predictions <- predict(rf, newdata = model_final, type = "prob")

并且在某处显然是一个新的因素水平，导致这个信息：

Error in predict.randomForest(rf, newdata = model_final, type = "prob") : 
  New factor levels not present in the training data

忽略特定于我的数据和模型的所有内容，是否有办法强制predict指定哪些列是具有新因子级别的列？或者是否有另一种快速，程序化的方法来识别有问题的列？

Answer 1

假设训练和测试集具有相同的列顺序，只需使用单个mapply来识别factor级别不同的位置：

示例数据

training <- data.frame(a=as.factor(letters), b=letters, stringsAsFactors=F)
test     <- data.frame(a=as.factor(rep(letters[1:20],3)), b=rep(letters[1:20],3), stringsAsFactors=F)

解决方案

> mapply(function(x,y) identical(levels(x), levels(y)), training, test )
    a     b 
FALSE  TRUE

如果上述值返回FALSE，则训练和测试集之间的因子水平之间存在差异。由于它在数字，逻辑或字符列的情况下使用identical，因此函数levels在identical返回TRUE的两种情况下都返回NULL。

假设我在你的问题中得到了你的意思，只需查看从上述函数返回的FALSE列。

使用randomForest使预测指定R中的坏因子级别

1 个答案: