Question

大家好，非常感谢您的帮助。

我已经执行了一个随机森林模型进行分类。现在，我想确定最佳阈值，以优化特异性和敏感性。

我很困惑，因为如标题中所述，“ pROC”包的“ coords”函数返回的值不同于“插入符”包的“ confusionMatrix”函数的值。

下面是代码：

# package import

library(caret)
library(pROC)

# data import

data <- read.csv2("denonciation.csv", check.names = F)

# data partition

validation_index <- createDataPartition(data$Denonc, p=0.80,list=FALSE)
validation <- data[-validation_index,]
entrainement <- data[validation_index,]

# handling class imbalance

set.seed (7)
up_entrainement <- upSample(x=entrainement[,-ncol(entrainement)],y=entrainement$Denonc)

# Cross validation setting

control <- trainControl(method ="cv", number=10, classProbs = TRUE)

# Model training

fit.rf_up <-train(Denonc~EMOTION+Agreabilite_classe+Conscienciosite_classe, data = up_entrainement, method="rf", trControl = control)

# Best threshold determination

roc <- roc(up_entrainement$Denonc, predict(fit.rf_up, up_entrainement, type = "prob")[,2])
    coords(roc, x="best", input = "threshold", best.method = "closest.topleft")

### The best threshold seems to be .36 with a specificity of .79 and a sensitivity of .73 ###

# Confusion matrix with the best threshold returned by "coords"

probsTest <- predict(fit.rf_up, validation, type = "prob")
threshold <- 0.36
predictions <- factor(ifelse(probsTest[, "denoncant"] > threshold, "denoncant", "non_denoncant"))
confusionMatrix(predictions, validation$Denonc)

此处的值不同：

Confusion Matrix and Statistics

                Reference
Prediction      denoncant non_denoncant
  denoncant           433          1380
  non_denoncant       386          1671

           Accuracy : 0.5437          
             95% CI : (0.5278, 0.5595)
No Information Rate : 0.7884          
P-Value [Acc > NIR] : 1               

              Kappa : 0.0529          
 Mcnemar's Test P-Value : <2e-16          

        Sensitivity : 0.5287          
        Specificity : 0.5477          
     Pos Pred Value : 0.2388          
     Neg Pred Value : 0.8123          
         Prevalence : 0.2116          
     Detection Rate : 0.1119          
   Detection Prevalence : 0.4685          
    Balanced Accuracy : 0.5382          

   'Positive' Class : denoncant

请，您能告诉我为什么“ pROC”软件包的“ coords”函数返回错误值吗？

非常感谢，

狒狒

Answer 1

我在这里看到2个可能的问题：

在训练模型时，通过对较少种类的样本进行上采样来平衡2类样本：在相同的上采样数据集上也校准了模型产生的最佳阈值。据我所知，验证数据集并非如此。
这两个结果给出了不同集合上的模型指标（训练和验证）：尽管对于RandomForest模型来说，它们应该是靠得很近的，但考虑到所有在幕后进行的平均，但这并不意味着结果将完全相同。 RandomForest模型不太可能过度拟合数据，但是如果数据由具有不同特征向量分布和/或不同特征-响应关系的几个不同总体组成的混合物构成，则可能并非总是如此。即使您确实随机采样数据，也可以均匀地分布在训练和验证集中（即，分配可能平均而言是相同的，但对于特定的训练-验证划分不是一样）。

我认为第一个问题出在哪里，但是很遗憾，我无法测试您的代码，因为它取决于文件denonciation.csv。

“ pROC”软件包的“ coords”功能与“ caret”软件包的“ confusionMatrix”功能返回的灵敏度和特异性值不同

1 个答案: