我使用RPart构建决策树。没有问题,我这样做。但是,我需要学习(或计算)树被分裂的次数?我的意思是,树有多少规则(if-else语句)? 例如:
X
- -
if (a<9)- - if(a>=9)
Y H
-
if(b>2)-
Z
有3条规则。
当我写摘要(模型)时:
摘要(model_dt)
Call:
rpart(formula = Alert ~ ., data = train)
n= 18576811
CP nsplit rel error xerror xstd
1 0.9597394 0 1.00000000 1.00000000 0.0012360956
2 0.0100000 1 0.04026061 0.05290522 0.0002890205
Variable importance
ip.src frame.protocols tcp.flags.ack tcp.flags.reset frame.len
20 17 17 17 16
ip.ttl
` 12
Node number 1: 18576811 observations, complexity param=0.9597394
predicted class=yes expected loss=0.034032 P(node) =1
class counts: 632206 1.79446e+07
probabilities: 0.034 0.966
left son=2 (627091 obs) right son=3 (17949720 obs)
Primary splits:
ip.src splits as LLLLLLLRRRLLRR ............ LLRLRLRRRRRRRRRRRRRRRR
improve=1170831.0, (0 missing)
ip.dts splits as LLLLLLLLLLLLLLLLLLLRLLLLLLLLLLL, improve=1013082.0, (0 missing)
tcp.flags.ctl < 1.5 to the right, improve=1007953.0, (2645 missing)
tcp.flags.syn < 1.5 to the right, improve=1007953.0, (2645 missing)
frame.len < 68 to the right, improve= 972871.3, (30 missing)
Surrogate splits:
frame.protocols splits as LLLLLLLLLLLLLLLLLLLRLLLLLLLLLLL, agree=0.995, adj=0.841, (0 split)
tcp.flags.ack < 1.5 to the right, agree=0.994, adj=0.836, (0 split)
tcp.flags.reset < 1.5 to the right, agree=0.994, adj=0.836, (0 split)
frame.len < 68 to the right, agree=0.994, adj=0.809, (0 split)
ip.ttl < 230.5 to the right, agree=0.987, adj=0.612, (0 split)
Node number 2: 627091 observations
predicted class=no expected loss=0.01621615 P(node) =0.03375666
class counts: 616922 10169
probabilities: 0.984 0.016
Node number 3: 17949720 observations
predicted class=yes expected loss=0.0008514896 P(node) =0.9662433
class counts: 15284 1.79344e+07
probabilities: 0.001 0.999
如果有人帮助我理解它,我将不胜感激
此致 ERAY
答案 0 :(得分:4)
通过一些关于如何返回树对象(?rpart.object
)的知识,有几种方法可以实现这一点。
我将在kyphosis
中的第一个示例之后使用R中的?rpart
数据集显示两种方式:
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
> tail(fit$cptable[, "nsplit"], 1)
3
4
> unname(tail(fit$cptable[, "nsplit"], 1)) ## or
[1] 4
来自cptable
,其中包含有关给定大小的树的成本复杂性的信息
> fit$cptable
CP nsplit rel error xerror xstd
1 0.17647059 0 1.0000000 1.000000 0.2155872
2 0.01960784 1 0.8235294 1.176471 0.2282908
3 0.01000000 4 0.7647059 1.176471 0.2282908
根据我的记忆,该表的最后一行将引用当前最大的树。如果基于CP将树修剪为特定大小,则此矩阵的最后一行将包含此大小树的信息:
> fit2 <- prune(fit, cp = 0.02)
> fit2$cptable
CP nsplit rel error xerror xstd
1 0.1764706 0 1.0000000 1.000000 0.2155872
2 0.0200000 1 0.8235294 1.176471 0.2282908
第二个选项是计算拟合模型的<leaf>
组件的var
列中frame
的出现次数:
> fit$frame
var n wt dev yval complexity ncompete nsurrogate yval2.V1 yval2.V2
1 Start 81 81 17 1 0.17647059 2 1 1.00000000 64.00000000
2 Start 62 62 6 1 0.01960784 2 2 1.00000000 56.00000000
4 <leaf> 29 29 0 1 0.01000000 0 0 1.00000000 29.00000000
5 Age 33 33 6 1 0.01960784 2 2 1.00000000 27.00000000
10 <leaf> 12 12 0 1 0.01000000 0 0 1.00000000 12.00000000
11 Age 21 21 6 1 0.01960784 2 0 1.00000000 15.00000000
22 <leaf> 14 14 2 1 0.01000000 0 0 1.00000000 12.00000000
23 <leaf> 7 7 3 2 0.01000000 0 0 2.00000000 3.00000000
3 <leaf> 19 19 8 2 0.01000000 0 0 2.00000000 8.00000000
yval2.V3 yval2.V4 yval2.V5 yval2.nodeprob
1 17.00000000 0.79012346 0.20987654 1.00000000
2 6.00000000 0.90322581 0.09677419 0.76543210
4 0.00000000 1.00000000 0.00000000 0.35802469
5 6.00000000 0.81818182 0.18181818 0.40740741
10 0.00000000 1.00000000 0.00000000 0.14814815
11 6.00000000 0.71428571 0.28571429 0.25925926
22 2.00000000 0.85714286 0.14285714 0.17283951
23 4.00000000 0.42857143 0.57142857 0.08641975
3 11.00000000 0.42105263 0.57894737 0.23456790
此值 - 1是分割数。要进行计数,我们可以使用:
> grepl("^<leaf>$", as.character(fit$frame$var))
[1] FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
> sum(grepl("^<leaf>$", as.character(fit$frame$var))) - 1
[1] 4
我使用的正则表达式可能有点过分,但这意味着检查以(^
)开头并以($
)"<leaf>"
开头的字符串,即这是整个字符串。我使用grepl()
将var
列上的匹配作为逻辑向量返回,我们可以将TRUE
s相加并从中减去1。由于var
存储为因子,我将其转换为grepl()
调用中的字符向量。
您也可以使用grep()
执行此操作以返回匹配的索引,并使用length()
对其进行计数:
> grep("^<leaf>$", as.character(fit$frame$var))
[1] 3 5 7 8 9
> length(grep("^<leaf>$", as.character(fit$frame$var))) - 1
[1] 4