我不明白重要性函数(randomForest包)与Random Forest模型的重要性值之间有什么区别:
我计算了一个简单的RF分类模型,并尝试通过以下代码找到变量重要性:
rf_model$importance
0 1 MeanDecreaseAccuracy MeanDecreaseGini
X1 0.096886458 0.032546101 0.055488009 2472.172207
X2 0.030985037 0.025615202 0.027530078 1338.378297
X3 0.124302743 0.012551971 0.052402188 3091.891586
importance(rf_model)
0 1 MeanDecreaseAccuracy MeanDecreaseGini
X1 159.9149603 175.6265625 242.424683 2472.172207
X2 104.8273654 97.09338154 129.5084398 1338.378297
X3 157.0207876 86.93847182 216.6374153 3091.891586
为什么输出的前三列之间存在差异,而MeanDecreaseGini是相同的?
答案 0 :(得分:2)
默认情况下调用importance(rf_model)
时,度量将除以“标准错误”。考虑这个例子:
library(randomForest)
set.seed(4543)
data(mtcars)
mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=1000,
keep.forest=FALSE, importance=TRUE)
mtcars.rf$importance
#output
%IncMSE IncNodePurity
cyl 7.3939431 162.38777
disp 10.0468306 257.46627
hp 7.6801388 200.22729
drat 1.0921653 65.96165
wt 9.7998328 250.94940
qsec 0.6066792 38.52055
vs 0.7048540 24.75183
am 0.6201962 17.27180
gear 0.4110634 16.33811
carb 1.0549523 27.47096
与上述相同
importance(mtcars.rf, scale = FALSE)
%IncMSE IncNodePurity
cyl 7.3939431 162.38777
disp 10.0468306 257.46627
hp 7.6801388 200.22729
drat 1.0921653 65.96165
wt 9.7998328 250.94940
qsec 0.6066792 38.52055
vs 0.7048540 24.75183
am 0.6201962 17.27180
gear 0.4110634 16.33811
carb 1.0549523 27.47096
default:
importance(mtcars.rf)
%IncMSE IncNodePurity
cyl 15.767986 162.38777
disp 19.885128 257.46627
hp 18.177916 200.22729
drat 7.002942 65.96165
wt 18.479239 250.94940
qsec 5.022593 38.52055
vs 4.427525 24.75183
am 6.435329 17.27180
gear 3.968845 16.33811
carb 8.207903 27.47096
最后:
importance(mtcars.rf, scale = FALSE)[,1]/mtcars.rf$importanceSD
cyl disp hp drat wt qsec vs am gear carb
15.767986 19.885128 18.177916 7.002942 18.479239 5.022593 4.427525 6.435329 3.968845 8.207903
与importance(mtcars.rf)[,1]
all.equal(importance(mtcars.rf, scale = FALSE)[,1]/mtcars.rf$importanceSD,
importance(mtcars.rf)[,1])
#output
TRUE