如果我运行randomForest(y ~ x, data = df)
模型,x
我会得到一个超过53个级别的因子变量
Error in randomForest.default(m, y, ...) :
Can not handle categorical predictors with more than 53 categories.
如果我将x
更改为as.character(x)
并重新运行,则不会出现错误。
幕后的差异是什么?这两种类型都不被视为分类变量吗?
答案 0 :(得分:4)
我猜每个类别的名称都是一个数字值(因为randomForest()
在由字符组成时不能处理character class
。 randomForest()
将character class
视为数值变量(即numeric class
),非分类变量(即factor class
)。如果更改每个类别的名称,结果将会更改。
这是我的例子。如果x_是factor class
,则返回相同的结果。如果x_是integer class
或character class (but composed of numeric value)
,则输出取决于值。 as.character(x)
得到的结果是错误的错误!!
set.seed(1); cw <- data.frame(y = subset(ChickWeight, Time==18)$weight, x1 = sample(47) )
cw$x2 <- as.factor(cw$x1)
cw$x3 <- as.character(cw$x1)
cw$x4 <- 47:1
cw$x5 <- as.factor(47:1)
cw$x6 <- as.character(47:1)
cw$x7 <- c(letters, LETTERS[1:21])
cw$x8 <- as.factor(cw$x7)
# %Var explained # class(x_)
set.seed(1); randomForest(y ~ x1, cw) # -29.61 integer1
set.seed(1); randomForest(y ~ x2, cw) # -0.42 factor
set.seed(1); randomForest(y ~ x3, cw) # -29.61 character (numeric name1)
set.seed(1); randomForest(y ~ x4, cw) # -31.78 integer2
set.seed(1); randomForest(y ~ x5, cw) # -0.42 factor
set.seed(1); randomForest(y ~ x6, cw) # -31.78 character (numeric name2)
set.seed(1); randomForest(y ~ x7, cw) # error character (letter name)
set.seed(1); randomForest(y ~ x8, cw) # -0.42 factor