Question

我认为要尝试的分割数量会对性能产生巨大影响，因此决定将一个非有序因子中包含的信息分成92个级别，在某些变量中有一些有序级别。

但是经过测试，情况正好相反，我自己找不到解释，我觉得有人可能会有更多关于这种事情发生的原因的信息

我快速举例说明了链接代码中的行为。

library(rpart)

test = data.frame(a=sample(0:1,1000000,replace=T),b=sample(0:1,1000000,replace=T),
                  c=sample(0:1,1000000,replace=T),d=sample(0:1,1000000,replace=T),
                  e=sample(0:90,1000000,replace=T),f=rnorm(1000000,0,1))
test$a <- as.factor(test$a)
test$b <- as.factor(test$b)
test$c <- as.factor(test$c)
test$d <- as.factor(test$d)
test$e <- as.factor(test$e)

as.numeric(system.time((temp <- rpart(f~a+b+c+d,data= test , method="anova",cp=0.0)))[1],maxsurrogate=0)
as.numeric(system.time((temp <- rpart(f~e,data= test , method="anova",cp=0.0)))[1],maxsurrogate=0)


Rprof("test1.Rprof")
rpart(f~a+b+c+d,data= test , method="anova",cp=0.0)
Rprof()
Rprof("test2.Rprof")
rpart(f~e,data= test , method="anova",cp=0.0)
Rprof()

summaryRprof("test1.Rprof")
summaryRprof("test2.Rprof")

感谢您的时间

Rpart速度：变量数量比要尝试的分割数量更重要

0 个答案: