我承认这是一个很难回答的问题,除了编写它们的人之外,其他任何人都可以问到,但是我在R随机森林的三个不同版本中获得的结果是持久不变的。
有问题的三种方法是randomForest软件包,插入符号中的“ rf”方法和Ranger软件包。代码包括在下面。
所讨论的数据就是一个例子;在其他类似数据规范中,我看到了类似的情况。
LHS变量:参与方标识(Dem,Rep,Indep。)。右侧预测变量是人口统计数据。为了试图弄清楚bizarre results in the randomForest package到底发生了什么,我尝试在其他两种方法中实现相同的模型。我发现它们不会重现该特定异常。这特别奇怪,因为据我所知,插入符中的rf方法只是对randomForest包的间接使用。
我在每个实现中运行的三个规范是(1)三个类别分类,(2)删除独立类别,以及(3)与2相同,但是将单个观察值加扰为“独立”以将三个类别保留在该模型应产生与2相似的结果。据我所知,在任何情况下都不应有过多或不足的采样来解释结果。
我还注意到以下趋势:
下面,我同时包含了代码和混淆矩阵,它们显示了所讨论的差异。每次重新运行代码都会在混乱矩阵中产生类似的趋势;这不是“任何一次单独运行都可能产生奇怪结果”的问题。
有人知道为什么这些软件包通常会产生略有不同的结果(在randomForest中是链接问题的情况,非常不同),或者甚至更好,为什么它们在这种特定方式上会有所不同?例如,我应该知道这些包装的包装中是否存在某种样本加权/分层?
代码:
num_trees=1001
var_split=3
load("three_cat.Rda")
rf_three_cat <-randomForest(party_id_3_cat~{RHS Vars},
data=three_cat,
ntree=num_trees,
mtry=var_split,
type="classification",
importance=TRUE,confusion=TRUE)
two_cat<-subset(three_cat,party_id_3_cat!="2. Independents")
two_cat$party_id_3_cat<-droplevels(two_cat$party_id_3_cat)
rf_two_cat <-randomForest(party_id_3_cat~{RHS Vars},
data=two_cat,
ntree=num_trees,
mtry=var_split,
type="classification",
importance=TRUE,confusion=TRUE)
scramble_independent<-subset(three_cat,party_id_3_cat!="2. Independents")
scramble_independent[1,19]<-"2. Independents"
scramble_independent<- data.frame(lapply(scramble_independent, as.factor), stringsAsFactors=TRUE)
rf_scramble<-randomForest(party_id_3_cat~{RHS Vars},
data=scramble_independent,
ntree=num_trees,
mtry=var_split,
type="classification",
importance=TRUE,confusion=TRUE)
ranger_2<-ranger(formula=party_id_3_cat~{RHS Vars},
data=two_cat,
num.trees=num_trees,mtry=var_split)
ranger_3<-ranger(formula=party_id_3_cat~{RHS Vars},
data=three_cat,
num.trees=num_trees,mtry=var_split)
ranger_scram<-ranger(formula=party_id_3_cat~{RHS Vars},
data=scramble_independent,
num.trees=num_trees,mtry=var_split)
rfControl <- trainControl(method = "none", number = 1, repeats = 1)
rfGrid <- expand.grid(mtry = c(3))
rf_caret_3 <- train(party_id_3_cat~{RHS Vars},
data=three_cat,
method="rf", ntree=num_trees,
type="classification",
importance=TRUE,confusion=TRUE,
trControl = rfControl, tuneGrid = rfGrid)
rf_caret_2 <- train(party_id_3_cat~{RHS Vars},
data = two_cat,
method = "rf",ntree=num_trees,
type="classification",
importance=TRUE,confusion=TRUE,
trControl = rfControl, tuneGrid = rfGrid)
rf_caret_scramble <- train(party_id_3_cat~{RHS Vars},
data = scramble_independent,
method = "rf",ntree=num_trees,
type="classification",
importance=TRUE,confusion=TRUE,
trControl = rfControl, tuneGrid = rfGrid)
rf_three_cat$confusion
ranger_3$confusion.matrix
rf_caret_3$finalModel["confusion"]
rf_two_cat$confusion
ranger_2$confusion.matrix
rf_caret_2$finalModel["confusion"]
rf_scramble$confusion
ranger_scram$confusion.matrix
rf_caret_scramble$finalModel["confusion"]
结果(为便于比较,略微修改了格式):
> rf_three_cat$confusion
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1121 3 697 0.3844042
2. Independents 263 7 261 0.9868173
3. Republicans (including leaners) 509 9 1096 0.3209418
> ranger_3$confusion.matrix
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1128 46 647 0.3805601
2. Independents 263 23 245 0.9566855
3. Republicans (including leaners) 572 31 1011 0.3736059
> rf_caret_3$finalModel["confusion"]
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1268 0 553 0.3036793
2. Independents 304 0 227 1.0000000
3. Republicans (including leaners) 606 0 1008 0.3754647
> rf_two_cat$confusion
1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1775 46 0.0252608
3. Republicans (including leaners) 1581 33 0.9795539
> ranger_2$confusion.matrix
1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1154 667 0.3662823
3. Republicans (including leaners) 590 1024 0.3655514
> rf_caret_2$finalModel["confusion"]
1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1315 506 0.2778693
3. Republicans (including leaners) 666 948 0.4126394
> rf_scramble$confusion
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1104 0 717 0.3937397
2. Independents 0 0 1 1.0000000
3. Republicans (including leaners) 501 0 1112 0.3106014
> ranger_scram$confusion.matrix
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners)
1. Democrats (including leaners) 1159 0 662 0.3635365
2. Independents 0 0 1 1.0000000
3. Republicans (including leaners) 577 0 1036 0.3577185
> rf_caret_scramble$finalModel["confusion"]
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1315 0 506 0.2778693
2. Independents 0 0 1 1.0000000
3. Republicans (including leaners) 666 0 947 0.4128952
答案 0 :(得分:0)
首先,随机森林算法是...随机的,因此默认情况下会出现一些变化。其次,更重要的是,算法是不同的,也就是说,它们使用不同的步骤,因此您将获得不同的结果。
您应该查看它们如何执行分割(标准:基尼,额外等),如果它们是随机的(极随机树),它们如何对引导程序样本进行采样(有/无替换)以及占多大比例,mtry
或在每个拆分中选择了多少个变量,节点中的最大深度或最大案例,等等。