Question

我正在尝试比较预测变量在预测不同响应变量中的重要性。但是，由于响应变量的差异，我不确定是否可以这样做。直接比较重要性是否有效？

如果没有方法可以做到这一点？我觉得排名重要性比较是可行的，但与此同时，我觉得它降低了每个模型相对重要性的可变性，从而改变了解释。

我的数据非常大，但为此，我可以使用虹膜数据集来表示我的意思。（这与我的数据看起来很相似，只是它被大规模缩小了。）

library(randomForest); library(ggplot2)  

test = data.frame(iris, 'Site' = rep(c('A', 'B', 'C', 'D', 'E'), times = 30))  

# make models
m1 = randomForest(y = test[, 5], x = test[,1:4], importance = TRUE, proximity = TRUE, ntree = 500, norm.votes = FALSE)  
m2 = randomForest(y = test[, 6], x = test[,1:4], importance = TRUE, proximity = TRUE, ntree = 500, norm.votes = FALSE)  

# look at importances 
m1Imp = importance(m1, type = 1, scale = F)   
m2Imp = importance(m2, type = 1, scale = F)  

# plot comparison
plotDF = data.frame('averageMeasure' = sample(1:100, 4)/100, m1Imp, m2Imp)  

plotDF %>%   
gather(key = "forest", value = "imp", -averageMeasure) %>%   
ggplot(aes(x = averageMeasure, y = imp, color = forest)) + geom_point() +  
scale_y_continuous("Predictor Importance")

如何比较具有不同响应变量的随机森林变量之间的重要性值？

0 个答案: