为什么在R中不同的随机森林实现会产生不同的结果?

时间:2018-09-10 18:47:13

标签: r machine-learning random-forest r-caret

我承认这是一个很难回答的问题,除了编写它们的人之外,其他任何人都可以问到,但是我在R随机森林的三个不同版本中获得的结果是持久不变的。

有问题的三种方法是randomForest软件包,插入符号中的“ rf”方法和Ranger软件包。代码包括在下面。

所讨论的数据就是一个例子;在其他类似数据规范中,我看到了类似的情况。

LHS变量:参与方标识(Dem,Rep,Indep。)。右侧预测变量是人口统计数据。为了试图弄清楚bizarre results in the randomForest package到底发生了什么,我尝试在其他两种方法中实现相同的模型。我发现它们不会重现该特定异常。这特别奇怪,因为据我所知,插入符中的rf方法只是对randomForest包的间接使用。

我在每个实现中运行的三个规范是(1)三个类别分类,(2)删除独立类别,以及(3)与2相同,但是将单个观察值加扰为“独立”以将三个类别保留在该模型应产生与2相似的结果。据我所知,在任何情况下都不应有过多或不足的采样来解释结果。

我还注意到以下趋势:

  1. randomForest软件包是唯一一个只有两个类别的软件包。
  2. Ranger软件包始终(正确和不正确)将更多观察结果识别为独立观察者。
  3. 游骑兵套餐的总体预测准确性总是稍差。
  4. 插入符号包的总体准确性与randomForest相似(略高),但在较普通的类中始终更好,而在较不普通的类中始终更差。这很奇怪,因为据我所知,无论哪种情况我都没有实现过采样或欠采样,而且我认为插入符号依赖randomForest包。

下面,我同时包含了代码和混淆矩阵,它们显示了所讨论的差异。每次重新运行代码都会在混乱矩阵中产生类似的趋势;这不是“任何一次单独运行都可能产生奇怪结果”的问题。

有人知道为什么这些软件包通常会产生略有不同的结果(在randomForest中是链接问题的情况,非常不同),或者甚至更好,为什么它们在这种特定方式上会有所不同?例如,我应该知道这些包装的包装中是否存在某种样本加权/分层?

代码:

num_trees=1001
var_split=3

load("three_cat.Rda")
rf_three_cat  <-randomForest(party_id_3_cat~{RHS Vars},
                         data=three_cat,
                         ntree=num_trees,
                         mtry=var_split,
                         type="classification",
                         importance=TRUE,confusion=TRUE)

two_cat<-subset(three_cat,party_id_3_cat!="2. Independents")    
two_cat$party_id_3_cat<-droplevels(two_cat$party_id_3_cat)
rf_two_cat    <-randomForest(party_id_3_cat~{RHS Vars},
                         data=two_cat,
                         ntree=num_trees,
                         mtry=var_split,
                         type="classification",
                         importance=TRUE,confusion=TRUE)
scramble_independent<-subset(three_cat,party_id_3_cat!="2. Independents")
scramble_independent[1,19]<-"2. Independents"
scramble_independent<- data.frame(lapply(scramble_independent, as.factor), stringsAsFactors=TRUE)
rf_scramble<-randomForest(party_id_3_cat~{RHS Vars},
                      data=scramble_independent,
                      ntree=num_trees,
                      mtry=var_split,
                      type="classification",
                      importance=TRUE,confusion=TRUE)

ranger_2<-ranger(formula=party_id_3_cat~{RHS Vars},
             data=two_cat,
             num.trees=num_trees,mtry=var_split)
ranger_3<-ranger(formula=party_id_3_cat~{RHS Vars},
             data=three_cat,
             num.trees=num_trees,mtry=var_split)
ranger_scram<-ranger(formula=party_id_3_cat~{RHS Vars},
                 data=scramble_independent,
                 num.trees=num_trees,mtry=var_split)

rfControl <- trainControl(method = "none", number = 1, repeats = 1)
rfGrid <- expand.grid(mtry = c(3))
rf_caret_3        <- train(party_id_3_cat~{RHS Vars},
                      data=three_cat,
                      method="rf", ntree=num_trees,
                      type="classification",
                      importance=TRUE,confusion=TRUE,
                      trControl = rfControl, tuneGrid = rfGrid)
rf_caret_2        <- train(party_id_3_cat~{RHS Vars},
                data = two_cat,
                method = "rf",ntree=num_trees,
                type="classification",
                importance=TRUE,confusion=TRUE,
                trControl = rfControl, tuneGrid = rfGrid)
rf_caret_scramble <- train(party_id_3_cat~{RHS Vars},
                      data = scramble_independent,
                      method = "rf",ntree=num_trees,
                      type="classification",
                      importance=TRUE,confusion=TRUE,
                      trControl = rfControl, tuneGrid = rfGrid)

rf_three_cat$confusion
ranger_3$confusion.matrix
rf_caret_3$finalModel["confusion"]

rf_two_cat$confusion
ranger_2$confusion.matrix
rf_caret_2$finalModel["confusion"]

rf_scramble$confusion
ranger_scram$confusion.matrix
rf_caret_scramble$finalModel["confusion"]

结果(为便于比较,略微修改了格式):

> rf_three_cat$confusion
                                     1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                                 1121               3                                697   0.3844042
2. Independents                                                   263               7                                261   0.9868173
3. Republicans (including leaners)                                509               9                               1096   0.3209418                        

> ranger_3$confusion.matrix
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1128              46                                647   0.3805601
2. Independents                                                 263              23                                245   0.9566855
3. Republicans (including leaners)                              572              31                               1011   0.3736059

> rf_caret_3$finalModel["confusion"]
                                     1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                                 1268               0                                553   0.3036793
2. Independents                                                   304               0                                227   1.0000000
3. Republicans (including leaners)                                606               0                               1008   0.3754647

> rf_two_cat$confusion
                                     1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                                 1775                                 46   0.0252608
3. Republicans (including leaners)                               1581                                 33   0.9795539

> ranger_2$confusion.matrix
                                   1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1154                                667   0.3662823
3. Republicans (including leaners)                              590                               1024   0.3655514

> rf_caret_2$finalModel["confusion"]
                                   1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1315                                  506   0.2778693
3. Republicans (including leaners)                              666                                  948   0.4126394

> rf_scramble$confusion
                                     1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1104               0                                717   0.3937397
2. Independents                                                   0               0                                  1   1.0000000
3. Republicans (including leaners)                              501               0                               1112   0.3106014

> ranger_scram$confusion.matrix
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners)
1. Democrats (including leaners)                               1159               0                               662  0.3635365
2. Independents                                                   0               0                                 1  1.0000000
3. Republicans (including leaners)                              577               0                              1036  0.3577185

> rf_caret_scramble$finalModel["confusion"]
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1315               0                                506   0.2778693
2. Independents                                                   0               0                                  1   1.0000000
3. Republicans (including leaners)                              666               0                                947   0.4128952

1 个答案:

答案 0 :(得分:0)

首先,随机森林算法是...随机的,因此默认情况下会出现一些变化。其次,更重要的是,算法是不同的,也就是说,它们使用不同的步骤,因此您将获得不同的结果。

您应该查看它们如何执行分割(标准:基尼,额外等),如果它们是随机的(极随机树),它们如何对引导程序样本进行采样(有/无替换)以及占多大比例,mtry或在每个拆分中选择了多少个变量,节点中的最大深度或最大案例,等等。