rpart的结果是根,但数据显示信息增益

时间:2017-10-31 06:39:49

标签: r machine-learning decision-tree rpart information-gain

我有一个事件率低于3%的数据集(即大约有700条记录有1级记录,27000条记录有0级记录)。

ID          V1  V2      V3  V5      V6  Target
SDataID3    161 ONE     1   FOUR    0   0
SDataID4    11  TWO     2   THREE   2   1
SDataID5    32  TWO     2   FOUR    2   0
SDataID7    13  ONE     1   THREE   2   0
SDataID8    194 TWO     2   FOUR    0   0
SDataID10   63  THREE   3   FOUR    0   1
SDataID11   89  ONE     1   FOUR    0   0
SDataID13   78  TWO     2   FOUR    0   0
SDataID14   87  TWO     2   THREE   1   0
SDataID15   81  ONE     1   THREE   0   0
SDataID16   63  ONE     3   FOUR    0   0
SDataID17   198 ONE     3   THREE   0   0
SDataID18   9   TWO     3   THREE   0   0
SDataID19   196 ONE     2   THREE   2   0
SDataID20   189 TWO     2   ONE     1   0
SDataID21   116 THREE   3   TWO     0   0
SDataID24   104 ONE     1   FOUR    0   0
SDataID25   5   ONE     2   ONE     3   0
SDataID28   173 TWO     3   FOUR    0   0
SDataID29   5   ONE     3   ONE     3   0
SDataID31   87  ONE     3   FOUR    3   0
SDataID32   5   ONE     2   THREE   1   0
SDataID34   45  ONE     1   FOUR    0   0
SDataID35   19  TWO     2   THREE   0   0
SDataID37   133 TWO     2   FOUR    0   0
SDataID38   8   ONE     1   THREE   0   0
SDataID39   42  ONE     1   THREE   0   0
SDataID43   45  ONE     1   THREE   1   0
SDataID44   45  ONE     1   FOUR    0   0
SDataID45   176 ONE     1   FOUR    0   0
SDataID46   63  ONE     1   THREE   3   0

我试图找出使用决策树的分裂。但树的结果只有1根。

> library(rpart)
> tree <- rpart(Target ~ ., data=subset(train, select=c( -Record.ID) ),method="class")
> printcp(tree)

Classification tree:
rpart(formula = Target ~ ., data = subset(train, select = c(-Record.ID)), method = "class")

Variables actually used in tree construction:
character(0)

Root node error: 749/18239 = 0.041066

n= 18239 

  CP nsplit rel error xerror xstd
1  0      0         1      0    0

在阅读StackOverflow上的大部分资源后,我放松/调整了控制参数,这给了我想要的决策树。

> tree <- rpart(Target ~ ., data=subset(train, select=c( -Record.ID) ),method="class" ,control =rpart.control(minsplit = 1,minbucket=2, cp=0.00002))
> printcp(tree)

Classification tree:
rpart(formula = Target ~ ., data = subset(train, select = c(-Record.ID)), 
    method = "class", control = rpart.control(minsplit = 1, minbucket = 2, 
        cp = 2e-05))

Variables actually used in tree construction:
[1] V5         V2                     V1          
[4] V3         V6

Root node error: 749/18239 = 0.041066

n= 18239 

          CP nsplit rel error xerror     xstd
1 0.00024275      0   1.00000 1.0000 0.035781
2 0.00019073     20   0.99466 1.0267 0.036235
3 0.00016689     34   0.99199 1.0307 0.036302
4 0.00014835     54   0.98798 1.0334 0.036347
5 0.00002000     63   0.98665 1.0427 0.036504

当我修剪树时,它产生了一个带有单个节点的树。

> pruned.tree <- prune(tree, cp = tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"])
> printcp(pruned.tree)

Classification tree:
rpart(formula = Target ~ ., data = subset(train, select = c(-Record.ID)), 
    method = "class", control = rpart.control(minsplit = 1, minbucket = 2, 
        cp = 2e-05))

Variables actually used in tree construction:
character(0)

Root node error: 749/18239 = 0.041066

n= 18239 

          CP nsplit rel error xerror     xstd
1 0.00024275      0         1      1 0.035781

树不应该只给出根节点,因为在数学上,在给定节点上(例如提供的)我们获得了信息增益。我不知道修剪是否犯了错误,或者rpart在处理低事件率数据集方面存在问题?

NODE    p       1-p     Entropy         Weights         Ent*Weight      # Obs
Node 1  0.032   0.968   0.204324671     0.351398601     0.071799404     10653
Node 2  0.05    0.95    0.286396957     0.648601399     0.185757467     19663

Sum(Ent*wght)       0.257556871 
Information gain    0.742443129 

1 个答案:

答案 0 :(得分:2)

您提供的数据并不反映两个目标类别的比例,因此我调整了数据以更好地反映这一点(参见数据部分):

> prop.table(table(train$Target))

         0          1 
0.96707581 0.03292419 

> 700/27700
[1] 0.02527076

这些比率现在相对接近......

library(rpart)
tree <- rpart(Target ~ ., data=train, method="class")
printcp(tree)

结果:

Classification tree:
rpart(formula = Target ~ ., data = train, method = "class")

Variables actually used in tree construction:
character(0)

Root node error: 912/27700 = 0.032924

n= 27700 

  CP nsplit rel error xerror xstd
1  0      0         1      0    0

现在,您只看到第一个模型的根节点的原因可能是由于您具有极不平衡的目标类,因此,您的自变量无法提供足够的信息来增长树。我的样本数据有3.3%的事件发生率,但你的只有2.5%左右!

正如您所提到的,有一种方法可以强制rpart种植树。这是覆盖默认复杂性参数(cp)。复杂度度量是树的大小和树分离目标类的程度的组合。从?rpart.control开始,“不会尝试任何不会降低整体缺乏适应度的分割”。这意味着此时您的模型没有超出根节点的分割,这会降低复杂程度,足以使rpart考虑到。{1}}。我们可以通过设置低或负cp来放宽这个被认为“足够”的阈值(负cp基本上会迫使树长到其全尺寸。)

tree <- rpart(Target ~ ., data=train, method="class" ,parms = list(split = 'information'), 
              control =rpart.control(minsplit = 1,minbucket=2, cp=0.00002))
printcp(tree)

结果:

Classification tree:
rpart(formula = Target ~ ., data = train, method = "class", parms = list(split = "information"), 
    control = rpart.control(minsplit = 1, minbucket = 2, cp = 2e-05))

Variables actually used in tree construction:
[1] ID V1 V2 V3 V5 V6

Root node error: 912/27700 = 0.032924

n= 27700 

           CP nsplit rel error xerror     xstd
1  4.1118e-04      0   1.00000 1.0000 0.032564
2  3.6550e-04     30   0.98355 1.0285 0.033009
3  3.2489e-04     45   0.97807 1.0702 0.033647
4  3.1328e-04    106   0.95504 1.0877 0.033911
5  2.7412e-04    116   0.95175 1.1031 0.034141
6  2.5304e-04    132   0.94737 1.1217 0.034417
7  2.1930e-04    149   0.94298 1.1458 0.034771
8  1.9936e-04    159   0.94079 1.1502 0.034835
9  1.8275e-04    181   0.93640 1.1645 0.035041
10 1.6447e-04    193   0.93421 1.1864 0.035356
11 1.5664e-04    233   0.92654 1.1853 0.035341
12 1.3706e-04    320   0.91228 1.2083 0.035668
13 1.2183e-04    344   0.90899 1.2127 0.035730
14 9.9681e-05    353   0.90789 1.2237 0.035885
15 2.0000e-05    364   0.90680 1.2259 0.035915

正如您所看到的,树已经发展到可以将复杂程度降低至少cp的大小。有两点需要注意:

  1. 在零nsplit时,CP已低至0.0004,其中cp中的默认rpart设置为0.01。
  2. nsplit == 0开始,当您增加分割数时,交叉验证错误(xerror会增加
  3. 这两个都表明您的模型过度拟合nsplit == 0及更高版本的数据,因为在模型中添加更多自变量不会添加足够的信息(CP减少不足)以减少交叉验证错误。话虽如此,在这种情况下,您的根节点模型最佳模型,这解释了为什么您的初始模型只有根节点。

    pruned.tree <- prune(tree, cp = tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"])
    printcp(pruned.tree)
    

    结果:

    Classification tree:
    rpart(formula = Target ~ ., data = train, method = "class", parms = list(split = "information"), 
        control = rpart.control(minsplit = 1, minbucket = 2, cp = 2e-05))
    
    Variables actually used in tree construction:
    character(0)
    
    Root node error: 912/27700 = 0.032924
    
    n= 27700 
    
              CP nsplit rel error xerror     xstd
    1 0.00041118      0         1      1 0.032564
    

    对于修剪部分,现在更清楚的是为什么您的修剪树是根节点树,因为超过0分割的树具有增加的交叉验证错误。以最小xerror获取树将使您获得根节点树的预期结果。

    信息增益基本上告诉您每次拆分添加了多少“信息”。从技术上讲,每个拆分都有一定程度的信息增益,因为你在模型中添加了更多的变量(信息增益总是非负的)。您应该考虑的是,额外增益(或无增益)是否足以减少错误,以保证更复杂的模型。因此,偏差和方差之间的权衡。

    在这种情况下,减少cp并稍后修剪生成的树并不合理。因为通过设置低cp,即使它过度拟合,您也会告诉rpart进行拆分,同时修剪“削减”所有过度补充的节点。

    数据:

    请注意,我正在为每个列和样本重排行,而不是对行索引进行采样。这是因为您提供的数据可能不是原始数据集的随机样本(可能存在偏差),因此我基本上随机创建新观察结果,并结合现有行,这有望减少这种偏差。

    init_train = structure(list(ID = structure(c(16L, 24L, 29L, 30L, 31L, 1L, 
    2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 
    17L, 18L, 19L, 20L, 21L, 22L, 23L, 25L, 26L, 27L, 28L), .Label = c("SDataID10", 
    "SDataID11", "SDataID13", "SDataID14", "SDataID15", "SDataID16", 
    "SDataID17", "SDataID18", "SDataID19", "SDataID20", "SDataID21", 
    "SDataID24", "SDataID25", "SDataID28", "SDataID29", "SDataID3", 
    "SDataID31", "SDataID32", "SDataID34", "SDataID35", "SDataID37", 
    "SDataID38", "SDataID39", "SDataID4", "SDataID43", "SDataID44", 
    "SDataID45", "SDataID46", "SDataID5", "SDataID7", "SDataID8"), class = "factor"), 
        V1 = c(161L, 11L, 32L, 13L, 194L, 63L, 89L, 78L, 87L, 81L, 
        63L, 198L, 9L, 196L, 189L, 116L, 104L, 5L, 173L, 5L, 87L, 
        5L, 45L, 19L, 133L, 8L, 42L, 45L, 45L, 176L, 63L), V2 = structure(c(1L, 
        3L, 3L, 1L, 3L, 2L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 1L, 3L, 2L, 
        1L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L
        ), .Label = c("ONE", "THREE", "TWO"), class = "factor"), 
        V3 = c(1L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 3L, 3L, 3L, 
        2L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 
        1L, 1L, 1L), V5 = structure(c(1L, 3L, 1L, 3L, 1L, 1L, 1L, 
        1L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 4L, 1L, 2L, 1L, 2L, 1L, 3L, 
        1L, 3L, 1L, 3L, 3L, 3L, 1L, 1L, 3L), .Label = c("FOUR", "ONE", 
        "THREE", "TWO"), class = "factor"), V6 = c(0L, 2L, 2L, 2L, 
        0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 2L, 1L, 0L, 0L, 3L, 0L, 
        3L, 3L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 3L), Target = c(0L, 
        1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
        0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
        )), .Names = c("ID", "V1", "V2", "V3", "V5", "V6", "Target"
    ), class = "data.frame", row.names = c(NA, -31L))
    
    set.seed(1000)
    train = as.data.frame(lapply(init_train, function(x) sample(x, 27700, replace = TRUE)))