我非常感谢您对我的RF模型的解释以及总体评估结果的反馈。
57658 samples
27 predictor
2 classes: 'stayed', 'left'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 11531, 11531, 11532, 11532, 11532
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 gini 0.6273579 0.9999011 0.0006250729
2 extratrees 0.6246980 0.9999197 0.0005667791
14 gini 0.5968382 0.9324610 0.1116113149
14 extratrees 0.6192781 0.9740323 0.0523004026
27 gini 0.5584677 0.7546156 0.2977507092
27 extratrees 0.5589923 0.7635036 0.2905489827
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.
在对我的Y变量的函数形式以及我分割数据的方式进行了几次调整之后,我得到了以下结果: 我的ROC略有改善,但有趣的是,与最初的模型相比,我的Sens&Spec发生了巨大变化。
35000 samples
27 predictor
2 classes: 'stayed', 'left'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 7000, 7000, 7000, 7000, 7000
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 gini 0.6351733 0.0004618204 0.9998685
2 extratrees 0.6287926 0.0000000000 0.9999899
14 gini 0.6032979 0.1346653886 0.9170874
14 extratrees 0.6235212 0.0753069696 0.9631711
27 gini 0.5725621 0.3016414054 0.7575899
27 extratrees 0.5716616 0.2998190728 0.7636219
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.
这一次,我随机地而不是按时间分割数据,并使用以下代码尝试了多个mtry值:
```{r Cross Validation Part 1}
set.seed(1992) # setting a seed for replication purposes
folds <- createFolds(train_data$left_welfare, k = 5) # Partition the data into 5 equal folds
tune_mtry <- expand.grid(mtry = c(2,10,15,20), splitrule = c("variance", "extratrees"), min.node.size = c(1,5,10))
sapply(folds,length)
并得到以下结果:
Random Forest
84172 samples
14 predictor
2 classes: 'stayed', 'left'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 16834, 16834, 16834, 16835, 16835
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 variance 0.5000000 NaN NaN
2 extratrees 0.7038724 0.3714761 0.8844723
5 variance 0.5000000 NaN NaN
5 extratrees 0.7042525 0.3870192 0.8727755
8 variance 0.5000000 NaN NaN
8 extratrees 0.7014818 0.4075797 0.8545012
10 variance 0.5000000 NaN NaN
10 extratrees 0.6956536 0.4336180 0.8310368
12 variance 0.5000000 NaN NaN
12 extratrees 0.6771292 0.4701687 0.7777730
15 variance 0.5000000 NaN NaN
15 extratrees 0.5000000 NaN NaN
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 5, splitrule = extratrees and min.node.size = 1.
答案 0 :(得分:1)
看来您的随机森林对第二个类别“左”几乎没有预测能力。
最好的分数都具有极高的敏感性和低特异性,这基本上意味着您的分类器只是将所有内容分类为“固定”类别,我想这是多数类别。不幸的是,这非常糟糕,因为它与天真的分类器说的一切都来自头等舱并不过分。
另外,我不太了解您是否仅尝试了mtry 2,14和27的值,但是在那种情况下,我强烈建议您尝试整个3-25范围(最佳值很可能在中间)。
除此之外,由于性能看起来很差(根据ROC的判断),我建议您在特征工程上进行更多工作以提取更多信息。否则,如果您对所拥有的内容不满意,或者您认为无法提取更多信息,则只需调整分类的概率阈值,以使您的敏感性和专一性反映出您对类的要求(您可能会更关心将“留下来”而不是“留下来”,反之亦然,我不知道您的问题)。
希望对您有帮助!