敏感性和特异性相似,但ROC下的面积不同-使用脱字符号比较不同方法

时间:2019-02-20 00:30:22

标签: r r-caret

我使用R中的插入符号包,使用相同的数据集和分层的10倍交叉验证比较了不同的方法(PLS-DA,支持向量机,人工神经网络,随机森林)。数据集有1394个记录。比较结果时,我注意到与其他具有相似敏感性和特异性的模型相比,随机森林的ROC曲线下面积更高。在ROC下,具有相似敏感性和特异性的模型是否总是具有相似的面积?

以下是PLS-DA(ANN和线性SVM得出的结果相似)和随机森林的以下代码:

  • PLS-DA

    Ycalib<-factor(file2[,1121],levels=c("1","0"),labels=c("pregnant","open")) # create the factor vector 
    names(Ycalib)<-c("y")
    Xcalib<-data.frame(file2[,1126:1663]) # create the data frame with spectral data
    
    set.seed(1001) 
    folds<-createFolds(Ycalib,k=10,list = TRUE, returnTrain = TRUE)  # statified folds for cross-validation 
    
    set.seed(1001) 
    ctrl<-trainControl(method="repeatedcv",index=folds,classProbs = TRUE,summaryFunction = twoClassSummary,savePredictions = TRUE) 
    
    set.seed(1001)
    plsda<-train(x=Xcalib, # spectral data
                  y=Ycalib, # factor vector
                  method="pls", # pls-da algorithm
                  tuneLength=60, # number of components
                  trControl=ctrl, # ctrl contained cross-validation option
                  preProc=c("center","scale"), # the data are centered and scaled
                  metric="ROC") # metric is ROC for 2 classes
    plsda
    
  • 随机森林

    Ycalib<-factor(file2[,1121],levels=c("1","0"),labels=c("pregnant","open")) # create the factor vector 
    names(Ycalib)<-c("y")
    Xcalib<-data.frame(file2[,1126:1663]) # create the data frame with spectral data
    
    mtry<-tuneRF(Xcalib, Ycalib, stepFactor=1) # automatically set the good value for mtry 
    mtry
    
    
    set.seed(1001)
    folds<-createFolds(Ycalib,k=10,list = TRUE, returnTrain = TRUE) 
    
    set.seed(1001)
    ctrl<-trainControl(method="repeatedcv",index=folds,classProbs = TRUE,summaryFunction = twoClassSummary,savePredictions = TRUE)
    
    customRF <- list(type = "Classification", library = "randomForest", loop = NULL) # code to be able to choose the mtry and ntree using a grid in the train function below)
    customRF$parameters <- data.frame(parameter = c("mtry", "ntree"), class = rep("numeric", 2), label = c("mtry", "ntree"))
    customRF$grid <- function(x, y, len = NULL, search = "grid") {}
    customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
    randomForest(x, y, mtry = param$mtry, ntree=param$ntree, ...)}
    customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
    predict(modelFit, newdata)
    customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
    predict(modelFit, newdata, type = "prob")
    customRF$sort <- function(x) x[order(x[,1]),]
    customRF$levels <- function(x) x$classes
    customRF
    
    grid <- expand.grid(mtry = 23, ntree = c(500, 1000) ) # change the mtry according to the results of the tuneRF function above, I can also the ntree
    
    set.seed(1001)
    rdforest<-train(x=Xcalib, # spectral data
              y=Ycalib, # factor vector
              method=customRF, # random forest algorithm (ustomRF instead of 'rf' to be able to choose the mtry and ntree using a grid)
              trControl=ctrl, # ctrl contained cross-validation option
              preProc=c("center","scale"), # the data are centered and scaled
              metric="ROC", # metric is ROC for 2 classes. Accuracy is used for multiple classes
              tuneGrid = grid) 
    rdforest
    

以下是结果:

  • PLS-DA结果

    ncomp  ROC        Sens        Spec     
    47     0.7382311  0.57758621  0.8119994
    
  • 随机森林结果

    mtry  ntree  ROC        Sens       Spec     
    23    500   0.8434449  0.5896552  0.8158085
    
  • PLS-DA交叉验证(10倍,重复1次)混淆矩阵

                        Reference
          Prediction pregnant open
            pregnant     24.0 11.0
            open         17.6 47.4
    
          Accuracy (average) : 0.7145
    
  • 随机森林交叉验证(10倍,重复1次)混淆矩阵

                    Reference
          Prediction pregnant open
            pregnant     25.7 10.0
            open         15.9 48.4
    
           Accuracy (average) : 0.7403
    

0 个答案:

没有答案