Question

我非常感谢您对我的RF模型的解释以及总体评估结果的反馈。

57658 samples
   27 predictor
    2 classes: 'stayed', 'left' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 11531, 11531, 11532, 11532, 11532 
Resampling results across tuning parameters:

  mtry  splitrule   ROC        Sens       Spec        
   2    gini        0.6273579  0.9999011  0.0006250729
   2    extratrees  0.6246980  0.9999197  0.0005667791
  14    gini        0.5968382  0.9324610  0.1116113149
  14    extratrees  0.6192781  0.9740323  0.0523004026
  27    gini        0.5584677  0.7546156  0.2977507092
  27    extratrees  0.5589923  0.7635036  0.2905489827

Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.

在对我的Y变量的函数形式以及我分割数据的方式进行了几次调整之后，我得到了以下结果：我的ROC略有改善，但有趣的是，与最初的模型相比，我的Sens＆Spec发生了巨大变化。

35000 samples
   27 predictor
    2 classes: 'stayed', 'left' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 7000, 7000, 7000, 7000, 7000 
Resampling results across tuning parameters:

  mtry  splitrule   ROC        Sens          Spec     
   2    gini        0.6351733  0.0004618204  0.9998685
   2    extratrees  0.6287926  0.0000000000  0.9999899
  14    gini        0.6032979  0.1346653886  0.9170874
  14    extratrees  0.6235212  0.0753069696  0.9631711
  27    gini        0.5725621  0.3016414054  0.7575899
  27    extratrees  0.5716616  0.2998190728  0.7636219

Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.

这一次，我随机地而不是按时间分割数据，并使用以下代码尝试了多个mtry值：

```{r Cross Validation Part 1}
set.seed(1992) # setting a seed for replication purposes 

folds <- createFolds(train_data$left_welfare, k = 5) # Partition the data into 5 equal folds

tune_mtry <- expand.grid(mtry = c(2,10,15,20), splitrule = c("variance", "extratrees"), min.node.size = c(1,5,10))

sapply(folds,length)

并得到以下结果：

Random Forest 

84172 samples
   14 predictor
    2 classes: 'stayed', 'left' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 16834, 16834, 16834, 16835, 16835 
Resampling results across tuning parameters:

  mtry  splitrule   ROC        Sens       Spec     
   2    variance    0.5000000        NaN        NaN
   2    extratrees  0.7038724  0.3714761  0.8844723
   5    variance    0.5000000        NaN        NaN
   5    extratrees  0.7042525  0.3870192  0.8727755
   8    variance    0.5000000        NaN        NaN
   8    extratrees  0.7014818  0.4075797  0.8545012
  10    variance    0.5000000        NaN        NaN
  10    extratrees  0.6956536  0.4336180  0.8310368
  12    variance    0.5000000        NaN        NaN
  12    extratrees  0.6771292  0.4701687  0.7777730
  15    variance    0.5000000        NaN        NaN
  15    extratrees  0.5000000        NaN        NaN

Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 5, splitrule = extratrees and min.node.size = 1.

Answer 1

看来您的随机森林对第二个类别“左”几乎没有预测能力。最好的分数都具有极高的敏感性和低特异性，这基本上意味着您的分类器只是将所有内容分类为“固定”类别，我想这是多数类别。不幸的是，这非常糟糕，因为它与天真的分类器说的一切都来自头等舱并不过分。
另外，我不太了解您是否仅尝试了mtry 2,14和27的值，但是在那种情况下，我强烈建议您尝试整个3-25范围（最佳值很可能在中间）。

除此之外，由于性能看起来很差（根据ROC的判断），我建议您在特征工程上进行更多工作以提取更多信息。否则，如果您对所拥有的内容不满意，或者您认为无法提取更多信息，则只需调整分类的概率阈值，以使您的敏感性和专一性反映出您对类的要求（您可能会更关心将“留下来”而不是“留下来”，反之亦然，我不知道您的问题）。

希望对您有帮助！

解释随机森林模型结果

1 个答案: