我在一组100名患者中使用插入符号对二元变量进行ML分类。由于此变量不平衡(每组中有13/87个样本),我使用SMOTE和ROSE进行子采样。
使用svmRadial的不同分类模型的平均ROC是:62.5%没有子采样,76.4%使用ROSE,77.8%使用SMOTE。如果我观察3次重复10倍CV后保持预测的准确性,我会得到最好的结果而不进行二次采样(87%),而SMOTE和ROSE的表现要差得多(71%和39%)。 / p>
有人可以向我解释为什么SMOTE和ROSE的较高ROC转化为预测结果中较低的准确度吗? 此外,我原本预计SMOTE和ROSE会改变样本的数量以及样本分布也会保留预测,但是当我查看我的混淆矩阵时,所有样本的总数总是n = 300(没有二次取样但是还有SMOTE和ROSE)。
不要太在意分类器的准确性差(它只是作为一个例子来说明我的问题......)
感谢您的帮助,
菲利普
my_method <- "svmRadial"
ctrl <- trainControl(method = "repeatedcv", repeats = 3, classProbs = TRUE,
summaryFunction = twoClassSummary, savePredictions = "final")
set.seed(1)
orig_fit <- train(Class ~ ., data = chosen_train,
method = my_method,
trControl = ctrl, metric="ROC", preProc = c("center", "scale"),vebose=F)
ctrl$sampling <- "rose"
set.seed(1)
rose_inside <- train(Class ~ ., data = chosen_train,
method = my_method,
trControl = ctrl, metric="ROC", preProc = c("center", "scale"),verbose=F)
ctrl$sampling <- "smote"
set.seed(1)
smote_inside <- train(Class ~ ., data = chosen_train,
method = my_method,
trControl = ctrl, metric="ROC", preProc = c("center", "scale"),verbose=F)
inside_models <- list(original = orig_fit, rose = rose_inside, smote=smote_inside)
set.seed(1)
inside_resampling <- resamples(inside_models)
>summary(inside_resampling, metric = "ROC")
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
original 0.4444 0.5556 0.6250 0.6569 0.7431 1 0
rose 0.3889 0.6667 0.7639 0.7757 0.8889 1 0
smote 0.4444 0.6667 0.7778 0.7845 0.8889 1 0
>confusionMatrix(rose_inside$pred$pred,rose_inside$pred$obs)
Reference
Prediction MAIN OTHER
MAIN 15 158
OTHER 24 103
Accuracy : 0.3933
95% CI : (0.3377, 0.4511)
No Information Rate : 0.87
P-Value [Acc > NIR] : 1
Kappa : -0.0897
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.38462
Specificity : 0.39464
Pos Pred Value : 0.08671
Neg Pred Value : 0.81102
Prevalence : 0.13000
Detection Rate : 0.05000
Detection Prevalence : 0.57667
Balanced Accuracy : 0.38963
'Positive' Class : MAIN
> confusionMatrix(smote_inside$pred$pred,smote_inside$pred$obs)
Confusion Matrix and Statistics
Reference
Prediction MAIN OTHER
MAIN 6 55
OTHER 33 206
Accuracy : 0.7067
95% CI : (0.6516, 0.7576)
No Information Rate : 0.87
P-Value [Acc > NIR] : 1.00000
Kappa : -0.0459
Mcnemar's Test P-Value : 0.02518
Sensitivity : 0.15385
Specificity : 0.78927
Pos Pred Value : 0.09836
Neg Pred Value : 0.86192
Prevalence : 0.13000
Detection Rate : 0.02000
Detection Prevalence : 0.20333
Balanced Accuracy : 0.47156
'Positive' Class : MAIN
> confusionMatrix(orig_fit$pred$pred,orig_fit$pred$obs)
Confusion Matrix and Statistics
Reference
Prediction MAIN OTHER
MAIN 0 0
OTHER 39 261
Accuracy : 0.87
95% CI : (0.8266, 0.9059)
No Information Rate : 0.87
P-Value [Acc > NIR] : 0.5426
Kappa : 0
Mcnemar's Test P-Value : 1.166e-09
Sensitivity : 0.00
Specificity : 1.00
Pos Pred Value : NaN
Neg Pred Value : 0.87
Prevalence : 0.13
Detection Rate : 0.00
Detection Prevalence : 0.00
Balanced Accuracy : 0.50
'Positive' Class : MAIN
答案 0 :(得分:1)
此处的准确性并不大,因为它与您的问题的无信息率(87/100)相同。
“SMOTE和ROSE的较高ROC转化为预测预测的准确度较低” - 我不认为这是一般性和正确的观察结果。