我想根据“均值下降基尼”和“均值下降精度”找到用于响应变量选择的标准。在我的原始数据集中,我有49个响应变量,并且我的机器花费大量时间(天)来提取栅格对象中的变量。然后,我寻找任何统计准则方法来删除随机森林(RF)模型中重要性较低的响应变量。我的一般示例出现问题:
#Packages
library(randomForest)
library(dplyr)
#Classical iris data set
data(iris)
Rep<-seq(1,length(iris[,1]))
all_iris<-cbind(Rep,iris)
#Tranning RF model
dg_o_cal<-all_iris %>% sample_n(150*0.8)
iris.rf <- randomForest(Species ~ Sepal.Length + Sepal.Width + Petal.Length
+ Petal.Width, data=dg_o_cal, importance=TRUE, proximity=TRUE)
#Plot Mean Decrease Gini and Mean Decrease Accuracy
varImpPlot(iris.rf)
现在,我想应用任何统计标准,例如,计算“平均下降基尼”和“平均下降准确度”为6.5的值。这样一来,将删除我的数据集的Septal.Width的重要性不高。我的最终模型是:
#Final RF
iris.rf2 <- randomForest(Species ~ Sepal.Length + Petal.Length
+ Petal.Width, data=dg_o_cal, importance=TRUE, proximity=TRUE)
iris.rf2
iris.rf
Call:
randomForest(formula = Species ~ Sepal.Length + Petal.Length + Petal.Width, data = dg_o_cal, importance = TRUE, proximity = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 5%
Confusion matrix:
setosa versicolor virginica class.error
setosa 36 0 0 0.00000000
versicolor 0 41 3 0.06818182
virginica 0 3 37 0.07500000
Call:
randomForest(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = dg_o_cal, importance = TRUE, proximity = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 5%
Confusion matrix:
setosa versicolor virginica class.error
setosa 36 0 0 0.00000000
versicolor 0 41 3 0.06818182
virginica 0 3 37 0.07500000
有了这个,我删除了低重要性变量和连续错误率的OOB估计。然后,我的问题是,针对我的情况,有什么统计标准的方法?