Question

我在一组100名患者中使用插入符号对二元变量进行ML分类。由于此变量不平衡（每组中有13/87个样本），我使用SMOTE和ROSE进行子采样。

使用svmRadial的不同分类模型的平均ROC是：62.5％没有子采样，76.4％使用ROSE，77.8％使用SMOTE。如果我观察3次重复10倍CV后保持预测的准确性，我会得到最好的结果而不进行二次采样（87％），而SMOTE和ROSE的表现要差得多（71％和39％）。 / p>

有人可以向我解释为什么SMOTE和ROSE的较高ROC转化为预测结果中较低的准确度吗？此外，我原本预计SMOTE和ROSE会改变样本的数量以及样本分布也会保留预测，但是当我查看我的混淆矩阵时，所有样本的总数总是n = 300（没有二次取样但是还有SMOTE和ROSE）。

不要太在意分类器的准确性差（它只是作为一个例子来说明我的问题......）

感谢您的帮助，

菲利普

my_method <- "svmRadial"
ctrl <- trainControl(method = "repeatedcv", repeats = 3, classProbs = TRUE,
                     summaryFunction = twoClassSummary, savePredictions = "final")
set.seed(1)
orig_fit <- train(Class ~ ., data = chosen_train,
                  method = my_method,
                  trControl = ctrl, metric="ROC", preProc = c("center", "scale"),vebose=F)

ctrl$sampling <- "rose"
set.seed(1)
rose_inside <- train(Class ~ ., data = chosen_train,
                     method = my_method,
                     trControl = ctrl, metric="ROC", preProc = c("center", "scale"),verbose=F)

ctrl$sampling <- "smote"
set.seed(1)
smote_inside <- train(Class ~ ., data = chosen_train,
                     method = my_method,
                     trControl = ctrl, metric="ROC", preProc = c("center", "scale"),verbose=F)

inside_models <- list(original = orig_fit, rose = rose_inside, smote=smote_inside)
set.seed(1)
inside_resampling <- resamples(inside_models)
>summary(inside_resampling, metric = "ROC")

           Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
original 0.4444  0.5556 0.6250 0.6569  0.7431    1    0
rose     0.3889  0.6667 0.7639 0.7757  0.8889    1    0
smote    0.4444  0.6667 0.7778 0.7845  0.8889    1    0


>confusionMatrix(rose_inside$pred$pred,rose_inside$pred$obs)

          Reference
Prediction MAIN OTHER
  MAIN       15   158
  OTHER          24   103

               Accuracy : 0.3933          
                 95% CI : (0.3377, 0.4511)
    No Information Rate : 0.87            
    P-Value [Acc > NIR] : 1               

                  Kappa : -0.0897         
 Mcnemar's Test P-Value : <2e-16          

            Sensitivity : 0.38462         
            Specificity : 0.39464         
         Pos Pred Value : 0.08671         
         Neg Pred Value : 0.81102         
             Prevalence : 0.13000         
         Detection Rate : 0.05000         
   Detection Prevalence : 0.57667         
      Balanced Accuracy : 0.38963         

       'Positive' Class : MAIN        

> confusionMatrix(smote_inside$pred$pred,smote_inside$pred$obs)
Confusion Matrix and Statistics

          Reference
Prediction MAIN OTHER
  MAIN        6    55
  OTHER          33   206

               Accuracy : 0.7067          
                 95% CI : (0.6516, 0.7576)
    No Information Rate : 0.87            
    P-Value [Acc > NIR] : 1.00000         

                  Kappa : -0.0459         
 Mcnemar's Test P-Value : 0.02518         

            Sensitivity : 0.15385         
            Specificity : 0.78927         
         Pos Pred Value : 0.09836         
         Neg Pred Value : 0.86192         
             Prevalence : 0.13000         
         Detection Rate : 0.02000         
   Detection Prevalence : 0.20333         
      Balanced Accuracy : 0.47156         

       'Positive' Class : MAIN        

> confusionMatrix(orig_fit$pred$pred,orig_fit$pred$obs)
Confusion Matrix and Statistics

          Reference
Prediction MAIN OTHER
  MAIN        0     0
  OTHER          39   261

               Accuracy : 0.87            
                 95% CI : (0.8266, 0.9059)
    No Information Rate : 0.87            
    P-Value [Acc > NIR] : 0.5426          

                  Kappa : 0               
 Mcnemar's Test P-Value : 1.166e-09       

            Sensitivity : 0.00            
            Specificity : 1.00            
         Pos Pred Value :  NaN            
         Neg Pred Value : 0.87            
             Prevalence : 0.13            
         Detection Rate : 0.00            
   Detection Prevalence : 0.00            
      Balanced Accuracy : 0.50            

       'Positive' Class : MAIN

Answer 1

此处的准确性并不大，因为它与您的问题的无信息率（87/100）相同。

“SMOTE和ROSE的较高ROC转化为预测预测的准确度较低” - 我不认为这是一般性和正确的观察结果。

插入符号 - 类不平衡的子采样

1 个答案: