Question

我正在使用R包randomForest，并且要了解变量的重要性，我们可以研究varImpPlot，它显示均值减少基尼系数。我已经详细研究了随机森林，并且很清楚该模型的详细工作原理，关于均值减少基尼的计算方法，或者为什么它取决于人口规模，我无法完全理解。

计算基尼系数后，我们可以通过以下公式（除以树木数量）来汇总平均下降基尼系数：

我知道人口增加时，每棵树上会有更多的分裂，但是这些分裂的平均基尼系数下降不应该很小吗？

这里是示例代码，显示了我的意思（正如预期的那样，树木数量不会影响平均基尼系数下降，但是种群数量会产生巨大的影响，并且似乎与种群数量成线性关系）

install.packages("randomForest")
library(randomForest)

set.seed(1)
a <- as.factor(c(rep(1, 20), rep(0, 30)))
b <- c(rnorm(20, 5, 2), rnorm(30, 4, 1))
c <- c(rnorm(25, 0, 1), rnorm(25, 1, 2))
data <- data.frame(a = a, b = b, c = c)

rf <- randomForest(data = data, a ~ b + c, importance = T, ntree = 300)
varImpPlot(rf)


a2 <- as.factor(c(rep(1, 200), rep(0, 300)))
b2 <- c(rnorm(200, 5, 2), rnorm(300, 4, 1))
c2 <- c(rnorm(250, 0, 1), rnorm(250, 1, 2))
data2 <- data.frame(a2 = a2, b2 = b2, c2 = c2)

rf2 <- randomForest(data = data2, a2 ~ b2 + c2, importance = T, ntree = 
300)
varImpPlot(rf2)


a3 <- as.factor(c(rep(1, 2000), rep(0, 3000)))
b3 <- c(rnorm(2000, 5, 2), rnorm(3000, 4, 1))
c3 <- c(rnorm(2500, 0, 1), rnorm(2500, 1, 2))
data3 <- data.frame(a3 = a3, b3 = b3, c3 = c3)

rf3 <- randomForest(data = data3, a3 ~ b3 + c3, importance = T, ntree = 
300)
varImpPlot(rf3)

在以下这些图中，我们看到人口的每增加x轴大约增加10倍：

我的猜测是，在进行的每个分割中都有一个基于人数的权重，也就是说，在第一个节点中进行的分割将1000个人的权重比在树上进一步进行的分割（比如说10）重人们，尽管在所有文献中我都找不到这一点，因为似乎所有计算都是考虑到人口的一部分而不是绝对数字。

我想念什么？

Answer 1

您的猜测是正确的。

您已写下单个拆分的基尼杂质的定义。随机森林中的树木通常会分裂多次。较高的节点具有更多的样本，直观上来说，它们更“不纯”。因此，基尼平均下降的公式考虑了节点的大小。

所以不是

Delta i(tau) = i(tau) - (n_l/n) i(tau_l) - (n_r/n) i(tau_r)

杂质减少量计算为

Delta i(tau) = n i(tau) - n_l i(tau_l) - n_r i(tau_r)

也就是说，按原始计数而不是比例对杂质进行加权。

该算法会不断将树拆分为最大可能的大小（除非您指定nodesize或maxnodes参数）。因此，可以为分割标准多次选择特征。它的整体重要性是这些拆分处Delta的总和。这是一棵树的重要性计算。最后，对森林中所有树木的重要性进行平均。

让我们用一个非常人为的例子来展示这一点。

library("randomForest")
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
set.seed(1)

n <- 1000
# There are three classes in equal proportions
a <- rep(c(-10,0,10), each = n)
# One feature is useless
b <- rnorm(3*n)
# The other feature is highly predictive but we need at least two splits
c <- rnorm(3*n, a)
data <- data.frame(a = as.factor(a), b = b, c = c)

# First let's do just one split, i.e., ask for just two terminal nodes

# Expected MeanDecreaseGini:
# With one split the best we can do is separate one class from the other two
3000*(2/3) - 1000*0 - 2000*(1/2)
#> [1] 1000

# Actual MeanDecreaseGini
rf3 <- randomForest(data = data, a ~ b + c, importance = TRUE,
                    ntree = 1000, mtry = 2, maxnodes = 2)
rf3$importance[, "MeanDecreaseGini"]
#>        b        c 
#>    0.000 1008.754


# Next let's do two splits; this is enough to separate classes perfectly

# Expected MeanDecreaseGini:
3000*(2/3) - 1000*0 - 2000*(1/2)  +   2000*(1/2) - 1000*0 - 1000*0
#> [1] 2000

# Actual MeanDecreaseGini
rf3 <- randomForest(data = data, a ~ b + c, importance = TRUE,
                    ntree = 1000, mtry = 2, maxnodes = 3)
rf3$importance[, "MeanDecreaseGini"]
#>        b        c 
#>    0.000 1999.333

^{由reprex package（v0.2.1）于2019-03-08创建}

PS：知道如何使用基尼标准来计算重要性非常好。但是请阅读本文以获取解释，为什么您应该改用排列重要性：https://explained.ai/rf-importance/index.html

为什么随机森林中的平均降低基尼系数取决于人口规模？

1 个答案: