我尝试实现RPART,以便以后进行一些开发。到目前为止,仅适用于回归(ANOVA)模型。除了一件事之外,一切看起来都非常干净-RPART如何在具有相同改进的多个预测变量之间选择最佳划分。
例如,对于初始拆分,我有三个预测变量,它们给出相同的结果(相同的改进,相同的拆分,彼此的完美替代)—例如X310
,X312
和X317
。 RPART默认选择X312
,但它不是列序列中的第一个预测变量。如果我置换列,RPART将选择X312
或X317
,但不会选择X310。
以下是选择X312
时的摘要示例:
Node number 1: 100 observations, complexity param=0.7123717
mean=0.5155042, MSE=0.08350028
left son=2 (47 obs) right son=3 (53 obs)
Primary splits:
X312 < 0.03673 to the left, improve=0.7123717, (0 missing)
X317 < 0.0187715 to the left, improve=0.7123717, (0 missing)
X310 < 0.0440585 to the left, improve=0.7123717, (0 missing)
X318 < 0.0167545 to the left, improve=0.7123435, (0 missing)
X323 < 0.0101715 to the left, improve=0.7092180, (0 missing)
当它选择X317
时:
Node number 1: 100 observations, complexity param=0.7123717
mean=0.5155042, MSE=0.08350028
left son=2 (47 obs) right son=3 (53 obs)
Primary splits:
X317 < 0.0187715 to the left, improve=0.7123717, (0 missing)
X312 < 0.03673 to the left, improve=0.7123717, (0 missing)
X310 < 0.0440585 to the left, improve=0.7123717, (0 missing)
X318 < 0.0167545 to the left, improve=0.7123435, (0 missing)
X323 < 0.0101715 to the left, improve=0.7092180, (0 missing)
再一次,一切都是相同的。我试图查看RPART的C代码,但找不到任何其他检查。对于任何想法都会非常感谢。