运行randomForest

时间:2016-06-15 05:49:35

标签: r random-forest

如果我运行randomForest(y ~ x, data = df)模型,x我会得到一个超过53个级别的因子变量

Error in randomForest.default(m, y, ...) : 
  Can not handle categorical predictors with more than 53 categories.

如果我将x更改为as.character(x)并重新运行,则不会出现错误。

幕后的差异是什么?这两种类型都不被视为分类变量吗?

1 个答案:

答案 0 :(得分:4)

我猜每个类别的名称都是一个数字值(因为randomForest()在由字符组成时不能处理character classrandomForest()character class视为数值变量(即numeric class),非分类变量(即factor class)。如果更改每个类别的名称,结果将会更改。

这是我的例子。如果x_是factor class,则返回相同的结果。如果x_是integer classcharacter class (but composed of numeric value),则输出取决于值。 as.character(x)得到的结果是错误的错误!!

set.seed(1); cw <- data.frame(y = subset(ChickWeight, Time==18)$weight, x1 = sample(47) )
cw$x2 <- as.factor(cw$x1)
cw$x3 <- as.character(cw$x1)
cw$x4 <- 47:1
cw$x5 <- as.factor(47:1)
cw$x6 <- as.character(47:1)
cw$x7 <- c(letters, LETTERS[1:21])
cw$x8 <- as.factor(cw$x7)
                               # %Var explained # class(x_)
set.seed(1); randomForest(y ~ x1, cw) # -29.61  integer1
set.seed(1); randomForest(y ~ x2, cw) # -0.42   factor
set.seed(1); randomForest(y ~ x3, cw) # -29.61  character (numeric name1)
set.seed(1); randomForest(y ~ x4, cw) # -31.78  integer2
set.seed(1); randomForest(y ~ x5, cw) # -0.42   factor
set.seed(1); randomForest(y ~ x6, cw) # -31.78  character (numeric name2)
set.seed(1); randomForest(y ~ x7, cw) # error   character (letter name)
set.seed(1); randomForest(y ~ x8, cw) # -0.42   factor