我正在使用R中的randomForest包。为了加快分类步骤,我有兴趣并行执行林。为此,我使用了'foreach'包,其方式类似于'foreach'小插图中的指示。这包括将树的总数除以您想要使用的核的数量,然后将它们与包'randomForest'的函数'combine'组合:
require(randomForest)
require(foreach)
require(doParallel)
registerDoParallel(cores=CPUS)
rf <- foreach::foreach(ntree=rep(ceiling(NTREE/CPUS), CPUS), .combine=randomForest::combine, .packages='randomForest') %dopar% {
randomForest::randomForest(x=t(Y), y=A, ntree=ntree, importance=TRUE, ...)
}
我将“平行”森林的结果与一个核心中产生的森林进行了比较。测试集的预测容量似乎相似,但“重要性”值大大降低,这会影响变量选择的以下步骤。
imp <- importance(rf,type=1)
我想知道为什么会发生这种情况,如果它是正确的或有任何错误。非常感谢!
答案 0 :(得分:1)
randomForest :: combine不支持重新计算变量重要性。在randomForest包中,重要性仅在randomForest :: randomForest函数终止之前计算。有两种选择:
编写自己的变量重要性函数,它将组合的林和训练集作为输入。这大约是50行代码。
使用'lapply'式并行计算,其中每个randomForest对象是输出列表中的一个元素。接下来聚合所有森林的变量重要性并简单地计算平均值。改为在foreach循环外使用do.call(rf.list,combine)。这个方法是总变量重要性的近似值,但是非常好。
Windows支持的代码示例:
library(randomForest)
library(doParallel)
CPUS=6; NTREE=5000
cl = makeCluster(CPUS)
registerDoParallel(cl)
data(iris)
rf.list = foreach(ntree = rep(NTREE/CPUS,CPUS),
.combine=c,
.packages="randomForest") %dopar% {
list(randomForest(Species~.,data=iris,importance=TRUE, ntree=ntree))
}
stopCluster(cl)
big.rf = do.call(combine,rf.list)
big.rf$importance = rf.list[[1]]$importance
for(i in 2:CPUS) big.rf$importance = big.rf$importance + rf.list[[i]]$importance
big.rf$importance = big.rf$importance / CPUS
varImpPlot(big.rf)
#test number of trees in one forest and combined forest, big.rf
print(big.rf) #5000 trees
rf.list[[1]]$ntree
#training single forest
rf.single = randomForest(Species~.,data=iris,ntree=5000,importance=T)
varImpPlot(big.rf)
varImpPlot(rf.single)
#print unscaled variable importance, no large deviations
print(big.rf$importance)
# setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
# Sepal.Length 0.033184860 0.023506673 0.04043017 0.03241500 9.679552
# Sepal.Width 0.008247786 0.002135783 0.00817186 0.00613059 2.358298
# Petal.Length 0.335508637 0.304525644 0.29786704 0.30933142 43.160074
# Petal.Width 0.330610910 0.307016328 0.27129746 0.30023245 44.043737
print(rf.single$importance)
# setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
# Sepal.Length 0.031771614 0.0236603417 0.03782824 0.031049531 9.516198
# Sepal.Width 0.008436457 0.0009236979 0.00880401 0.006048261 2.327478
# Petal.Length 0.341879367 0.3090482654 0.29766905 0.312507316 43.786481
# Petal.Width 0.322015885 0.3045458852 0.26885097 0.296227150 43.623370
#but when plotting using varImppLot, scale=TRUE by default
#either simply turn of scaling to get comparable results
varImpPlot(big.rf,scale=F)
varImpPlot(rf.single,scale=F)
#... or correct scaling to the number of trees
big.rf$importanceSD = CPUS^-.5 * big.rf$importanceSD
#and now there are no large differences for scaled variable importance either
varImpPlot(big.rf,scale=T)
varImpPlot(rf.single,scale=T)