Question

我对以下内容感到困惑：

set.seed(144)
df = data.frame(outcome=as.factor(sample(c('a','b','c'), 1000, replace=T)), x=rnorm(1000), y=rnorm(1000), z=rnorm(1000))
library(rpart)
fit.default = rpart(outcome ~ x + y + z, data=df, method='class')
fit.specified = rpart(outcome ~ x + y + z, data=df, method='class', parms=list(split='gini', loss=matrix(c(0,1,1,1,0,1,1,1,0), nrow=3,ncol=3,byrow=T)))
fit.default$cptable
fit.specified$cptable

它在xerror和xstd列中为指定的vs默认值生成不同的值。但根据？rpart的默认分割是＆＃39; gini＆＃39;默认损耗矩阵是我提供的1s（零对角线）矩阵。那为什么它的表现会有所不同呢？我注意到了这一点，因为我根据最小的错误选择了一棵不同的树，并希望验证基线默认情况。

Answer 1

说明我上面的评论，如果你完全解开它们的话：

set.seed(144)
df = data.frame(outcome=as.factor(sample(c('a','b','c'), 1000, replace=T)), 
                x=rnorm(1000), 
                y=rnorm(1000), 
                z=rnorm(1000))
library(rpart)
fit.default = rpart(outcome ~ x + y + z, 
                    data=df, 
                    method='class')
fit.default$cptable  

set.seed(144)
df = data.frame(outcome=as.factor(sample(c('a','b','c'), 1000, replace=T)), 
                x=rnorm(1000), 
                y=rnorm(1000), 
                z=rnorm(1000))
library(rpart)
fit.specified = rpart(outcome ~ x + y + z, 
                      data=df, 
                      method='class', 
                      parms=list(split='gini', 
                                loss=matrix(c(0,1,1,1,0,1,1,1,0), 
                                nrow=3,
                                ncol=3,
                                byrow=T)))

fit.specified$cptable

你得到：

> fit.default$cptable
         CP nsplit rel error    xerror       xstd
1 0.0375000      0  1.000000 1.0000000 0.02371708
2 0.0140625      1  0.962500 0.9640625 0.02401939
3 0.0100000      3  0.934375 0.9921875 0.02378775

和

> fit.specified$cptable
         CP nsplit rel error    xerror       xstd
1 0.0375000      0  1.000000 1.0000000 0.02371708
2 0.0140625      1  0.962500 0.9640625 0.02401939
3 0.0100000      3  0.934375 0.9921875 0.02378775

如果我使用默认值指定parms，为什么我会使用rpart获得不同的交叉验证错误？

1 个答案: