当比较正常和降采样数据的机器学习模型的ROC曲线时,由于降采样模型使类均匀,并且更加重视捕获次要类,因此所得的敏感性和特异性通常非常不同。为什么生成的ROC曲线看起来如此相似?
我认为最好以基于this question here的简单示例来解释这个问题。
首先,获取Sonar数据,然后手动对“ R”类进行采样以使数据失衡并说明我的问题:
library(caret)
library(ggplot2)
library(mlbench)
library(plotROC)
data(Sonar)
set.seed(2019)
sonar_R <- Sonar %>% filter(Class == "R") %>% sample_n(., 20)
Sonar <- Sonar %>% filter(Class == "M") %>% rbind(sonar_R)
现在将插入符用于具有主要类别的普通采样和向下采样的随机森林模型:
ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 5,
summaryFunction=twoClassSummary, classProbs=T,
savePredictions = T)
ctrl_down <- trainControl(method="repeatedcv", number = 5, repeats = 5,
summaryFunction=twoClassSummary, classProbs=T,
savePredictions = T, sampling = "down")
rfFit <- train(Class ~ ., data=Sonar, method="rf", preProc=c("center", "scale"),
trControl=ctrl)
rfFit_down <- train(Class ~ ., data=Sonar, method="rf", preProc=c("center", "scale"),
trControl=ctrl_down)
我现在可以定义一个函数以获取最大的ROC以及相应的敏感性和特异性:
max_accuracy <- function(model) {
model_accuracy <- as.data.frame(model$results)
model_accuracy <- model_accuracy %>%
select(ROC, Sens, Spec) %>%
arrange(desc(ROC))
model_accuracy <- model_accuracy[1,]
return(model_accuracy)
}
max_accuracy(rfFit)
max_accuracy(rfFit_down)
给予:
ROC Sens Spec
Normal 0.910 1 0.16
Down Sampled 0.872 0.827 0.77
并绘制ROC曲线:
selectedIndices <- rfFit$pred$mtry == 2
g <- ggplot(rfFit$pred[selectedIndices, ], aes(m=M, d=factor(obs, levels = c("R", "M")))) +
geom_roc(n.cuts=0, increasing = FALSE) +
coord_equal() +
style_roc(theme = theme_grey) +
ggtitle("Normal")
g +
annotate("text", x=0.75, y=0.25, label=paste("AUC =", round((calc_auc(g))$AUC, 4))) +
scale_x_continuous("1 - Specificity") + scale_y_continuous("Sensitivity")
selectedIndices_down <- rfFit$pred$mtry == 2
g_down <- ggplot(rfFit_down$pred[selectedIndices_down, ], aes(m=M, d=factor(obs, levels = c("R", "M")))) +
geom_roc(n.cuts=0, increasing = FALSE) +
coord_equal() +
style_roc(theme = theme_grey) +
ggtitle("Down Sampled")
g_down +
annotate("text", x=0.75, y=0.25, label=paste("AUC =", round((calc_auc(g_down))$AUC, 4))) +
scale_x_continuous("1 - Specificity") + scale_y_continuous("Sensitivity")
看起来像这样: