Question

我正在尝试了解randomForest包中和caret包中的随机林实现之间的区别。

例如，这会在mtry = 2中指定randomForest的2000棵树，并显示每个预测变量的基尼系数：

library(randomForest)
library(tidyr) 
rf1 <- randomForest(Species ~ ., data = iris, 
                      ntree = 2000, mtry = 2,
                      importance = TRUE)
data.frame(RF = sort(importance(rf1)[, "MeanDecreaseGini"], decreasing = TRUE)) %>% add_rownames() %>% rename(Predictor = rowname)
#      Predictor       RF
# 1  Petal.Width 45.57974
# 2 Petal.Length 41.61171
# 3 Sepal.Length  9.59369
# 4  Sepal.Width  2.47010

我试图在caret中获取相同的信息，但我不知道如何指定树的数量，或者如何获得基尼系数：

rf2 <- train(Species ~ ., data = iris, method = "rf",
              metric = "Kappa", 
              tuneGrid = data.frame(mtry = 2))
varImp(rf2) # not the Gini coefficient
#              Overall
# Petal.Length 100.000
# Petal.Width   99.307
# Sepal.Width    0.431
# qSepal.Length  0.000

此外，rf1的混淆矩阵有一些错误而rf2的混淆矩阵没有。导致这种差异的参数是什么？：

# rf1 Confusion matrix:
#            setosa versicolor virginica class.error
# setosa         50          0         0        0.00
# versicolor      0         47         3        0.06
# virginica       0          4        46        0.08

table(predict(rf2, iris), iris$Species)
#             setosa versicolor virginica
#  setosa         50          0         0
#  versicolor      0         50         0
#  virginica       0          0        50

这很快又脏。我知道这不是测试分类器性能的正确方法，但我不理解结果的差异。

Answer 1

这可能有助于回答这个问题 - 请参阅第二篇文章：

caret: using random forest and include cross-validation

randomforest是替换样本。如果在插入符号中使用“rf”，则需要在train :: caret（）中指定trControl;您希望在插入符号中使用相同的重采样方法，即引导程序，因此您需要设置trControl =“oob”。 TrControl是一个定义函数行为方式的值列表;这可以设置为“cv”用于交叉验证，“repeatedcv”用于重复交叉验证等。有关详细信息，请参阅插入符号包文档。

你应该得到与使用randomForest相同的结果，但要记得正确设置种子。

Answer 2

我最近也在寻找一种从MeanDecreasingGini的{{1}}实现中获取caret变量的解决方案。我意识到这是很久以前发布的，所以也许插入符号已更新，并且不再需要我的建议，但是我一直在努力寻找解决方案，因此希望有人觉得这很有用。

要设置插入符中的树数，您可以在训练过程中使用randomForest参数，就像使用ntrees=xx一样。然后，要输出randomForest中的MeanDecreasingGini，请指定caret（1 = type=2 [默认]，2 = MeanDecreasingAccuracy）和MeanDecreasingGini。完整的代码，其结果如下（经过几次运行后，我预测结果的大小会有微小的波动，这是随机的机会，但变量的等级是一致的）：

scale=FALSE

那么就混淆矩阵混淆而言（混淆措词？），这似乎是您计算混淆矩阵的方式的副产品。当我将预测函数用于两个模型时，与使用其他方法相比，我的精度提高到100％：

library(randomForest)
library(tidyr) 
library(caret)

##randomForest
rf1 <- randomForest(Species ~ ., data = iris, 
                    ntree = 2000, mtry = 2,
                    importance = TRUE)
data.frame(Gini=sort(importance(rf1, type=2)[,], decreasing=T))
# Gini
# Petal.Width  43.924705
# Petal.Length 43.293731
# Sepal.Length  9.717544
# Sepal.Width   2.320682

##caret
rf2 <- train(Species ~ ., 
             data = iris, 
             method = "rf",
             ntrees=2000, ##same as randomForest
             importance=TRUE, ##same as randomForest
             metric = "Kappa", 
             tuneGrid = data.frame(mtry = 2),
             trControl = trainControl(method = "none")) ##Stop the default bootstrap=25
varImp(rf2, type=2, scale=FALSE)
# rf variable importance
# 
# Overall
# Petal.Width   44.475
# Petal.Length  43.401
# Sepal.Length   9.140
# Sepal.Width    2.267

但是，我不确定rf1$confusion # setosa versicolor virginica class.error # setosa 50 0 0 0.00 # versicolor 0 47 3 0.06 # virginica 0 3 47 0.06 table(predict(rf1, iris), iris$Species) # setosa versicolor virginica # setosa 50 0 0 # versicolor 0 50 0 # virginica 0 0 50 rf2$finalModel$confusion # setosa versicolor virginica class.error # setosa 50 0 0 0.00 # versicolor 0 47 3 0.06 # virginica 0 5 45 0.10 table(predict(rf2, iris), iris$Species) # setosa versicolor virginica # setosa 50 0 0 # versicolor 0 50 0 # virginica 0 0 50和rf1$confusion是否都代表了最后一棵树的预测。也许对此有更好了解的人可以帮忙。

如何使用插入符R包中的随机森林获得基尼系数？

2 个答案: