我在R中做了一个简短的代码来检查拆分条件的工作方式。我得到了意外的结果,他们所有人都选择相同的值进行拆分。有人可以解释吗?这是代码:
set.seed(1)
y <- sample(c(1, 0), 10000, replace = T)
x <- seq(1, 10000)
data <- data.frame(x, y)
library(rpart)
rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
答案 0 :(得分:0)
在我的情况下,仅最后一个rpart
命令对某些内容进行了分割:
> set.seed(1)
> y <- sample(c(1, 0), 1000, replace = T)
> x <- seq(1, 1000)
> data <- data.frame(x, y)
> library(rpart)
不与split="gini"
拆分:
> rpart(y~x,data = data,parms=list(split="gini"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1000 480 1 (0.4800000 0.5200000) *
不与split="information"
拆分:
> rpart(y~x,data = data,parms=list(split="information"),method = "class",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1000 480 1 (0.4800000 0.5200000) *
使用split="anova"
进行了一次拆分:
> rpart(y~x,data = data,method = "anova",control = list(maxdepth = 1,cp=0.0001,minsplit=1))
n= 1000
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 249.6000 0.5200000
2) x< 841.5 841 210.1831 0.5089180 *
3) x>=841.5 159 38.7673 0.5786164 *
关于为什么拆分点可以位于同一位置,请从rpart documentation中提取几个:
所以在两类问题的情况下,不同的度量可能会产生相似的分裂点。