训练数据中不存在新的因子水平

时间:2013-06-27 20:11:47

标签: r random-forest

当尝试使用randomForest的输出来分类新数据(甚至原始训练数据)时,我收到以下错误:

> res.rf5 <- predict(model.rf5, train.rf5)
Error in predict.randomForest(model.rf5, train.rf5) :
  New factor levels not present in the training data

这个错误是什么意思?为什么即使在我尝试预测用于训练的相同数据时也会发生此错误?

下面是一个可用于重现错误的小例子。

train.rf5 <- structure(
  list(A = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 3L),
                     .Label = c("(-0.1,19.9]", "(19.9,40]", "(80.1,100]"),
                     class = c("ordered", "factor")),
       B = structure(c(3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 4L),
                     .Label = c("1", "2", "4", "5"),
                     class = c("ordered", "factor")),
       C = structure(c(1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L),
                     .Label = c("FALSE", "TRUE"),
                     class = "factor")),
  .Names = c("A", "B", "C"),
  row.names = c(7L, 8L, 10L, 11L, 13L, 15L, 16L, 17L, 18L, 19L),
  class = "data.frame")

#              A B     C
# 7    (19.9,40] 4 FALSE
# 8  (-0.1,19.9] 1 FALSE
# 10 (-0.1,19.9] 1  TRUE
# 11 (-0.1,19.9] 1 FALSE
# 13 (-0.1,19.9] 1 FALSE
# 15 (-0.1,19.9] 1  TRUE
# 16  (80.1,100] 2  TRUE
# 17 (-0.1,19.9] 1 FALSE
# 18 (-0.1,19.9] 1 FALSE
# 19  (80.1,100] 5  TRUE

require(randomForest)
model.rf5 <- randomForest(C ~ ., data = train.rf5)
res.rf5 <- predict(model.rf5, train.rf5)  # Causes error

我在SO上看到一些可能相关的问题,但我不认为他们直接解决了我的问题

  1. dropping factor levels in a subsetted data frame in R
  2. Random forest package in R shows error during prediction() if there are new factor levels present in test data. Is there any way to avoid this error?
  3. 与1)不同,我没有未在数据中表示的因子水平,与2)不同,我的训练和测试数据中的因子水平是相同的。

    编辑:其他信息:

    sessionInfo()
    R version 3.0.1 (2013-05-16)
    Platform: x86_64-pc-linux-gnu (64-bit)
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
     [7] LC_PAPER=C                 LC_NAME=C                 
     [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
    [1] randomForest_4.6-7
    
    loaded via a namespace (and not attached):
    [1] tools_3.0.1
    

1 个答案:

答案 0 :(得分:5)

我测试了我的推测,即有序因子是问题的根源,并且当我做的唯一事情就是从该结构的类中删除“有序”时,不会得到任何错误。我没有在文档中看到不允许有序因素,但我也没有看到它们是专门考虑的。这有可能之前没有出现过。看起来排序会带来额外的复杂性,如果你想要考虑订单,你可以为RF算法提供as.numeric(.)“得分”。