我对以下内容感到困惑:
set.seed(144)
df = data.frame(outcome=as.factor(sample(c('a','b','c'), 1000, replace=T)), x=rnorm(1000), y=rnorm(1000), z=rnorm(1000))
library(rpart)
fit.default = rpart(outcome ~ x + y + z, data=df, method='class')
fit.specified = rpart(outcome ~ x + y + z, data=df, method='class', parms=list(split='gini', loss=matrix(c(0,1,1,1,0,1,1,1,0), nrow=3,ncol=3,byrow=T)))
fit.default$cptable
fit.specified$cptable
它在xerror和xstd列中为指定的vs默认值生成不同的值。但根据?rpart的默认分割是' gini'默认损耗矩阵是我提供的1s(零对角线)矩阵。那为什么它的表现会有所不同呢?我注意到了这一点,因为我根据最小的错误选择了一棵不同的树,并希望验证基线默认情况。
答案 0 :(得分:3)
说明我上面的评论,如果你完全解开它们的话:
set.seed(144)
df = data.frame(outcome=as.factor(sample(c('a','b','c'), 1000, replace=T)),
x=rnorm(1000),
y=rnorm(1000),
z=rnorm(1000))
library(rpart)
fit.default = rpart(outcome ~ x + y + z,
data=df,
method='class')
fit.default$cptable
set.seed(144)
df = data.frame(outcome=as.factor(sample(c('a','b','c'), 1000, replace=T)),
x=rnorm(1000),
y=rnorm(1000),
z=rnorm(1000))
library(rpart)
fit.specified = rpart(outcome ~ x + y + z,
data=df,
method='class',
parms=list(split='gini',
loss=matrix(c(0,1,1,1,0,1,1,1,0),
nrow=3,
ncol=3,
byrow=T)))
fit.specified$cptable
你得到:
> fit.default$cptable
CP nsplit rel error xerror xstd
1 0.0375000 0 1.000000 1.0000000 0.02371708
2 0.0140625 1 0.962500 0.9640625 0.02401939
3 0.0100000 3 0.934375 0.9921875 0.02378775
和
> fit.specified$cptable
CP nsplit rel error xerror xstd
1 0.0375000 0 1.000000 1.0000000 0.02371708
2 0.0140625 1 0.962500 0.9640625 0.02401939
3 0.0100000 3 0.934375 0.9921875 0.02378775