我使用R中的插入符号包,使用相同的数据集和分层的10倍交叉验证比较了不同的方法(PLS-DA,支持向量机,人工神经网络,随机森林)。数据集有1394个记录。比较结果时,我注意到与其他具有相似敏感性和特异性的模型相比,随机森林的ROC曲线下面积更高。在ROC下,具有相似敏感性和特异性的模型是否总是具有相似的面积?
以下是PLS-DA(ANN和线性SVM得出的结果相似)和随机森林的以下代码:
PLS-DA
Ycalib<-factor(file2[,1121],levels=c("1","0"),labels=c("pregnant","open")) # create the factor vector
names(Ycalib)<-c("y")
Xcalib<-data.frame(file2[,1126:1663]) # create the data frame with spectral data
set.seed(1001)
folds<-createFolds(Ycalib,k=10,list = TRUE, returnTrain = TRUE) # statified folds for cross-validation
set.seed(1001)
ctrl<-trainControl(method="repeatedcv",index=folds,classProbs = TRUE,summaryFunction = twoClassSummary,savePredictions = TRUE)
set.seed(1001)
plsda<-train(x=Xcalib, # spectral data
y=Ycalib, # factor vector
method="pls", # pls-da algorithm
tuneLength=60, # number of components
trControl=ctrl, # ctrl contained cross-validation option
preProc=c("center","scale"), # the data are centered and scaled
metric="ROC") # metric is ROC for 2 classes
plsda
随机森林
Ycalib<-factor(file2[,1121],levels=c("1","0"),labels=c("pregnant","open")) # create the factor vector
names(Ycalib)<-c("y")
Xcalib<-data.frame(file2[,1126:1663]) # create the data frame with spectral data
mtry<-tuneRF(Xcalib, Ycalib, stepFactor=1) # automatically set the good value for mtry
mtry
set.seed(1001)
folds<-createFolds(Ycalib,k=10,list = TRUE, returnTrain = TRUE)
set.seed(1001)
ctrl<-trainControl(method="repeatedcv",index=folds,classProbs = TRUE,summaryFunction = twoClassSummary,savePredictions = TRUE)
customRF <- list(type = "Classification", library = "randomForest", loop = NULL) # code to be able to choose the mtry and ntree using a grid in the train function below)
customRF$parameters <- data.frame(parameter = c("mtry", "ntree"), class = rep("numeric", 2), label = c("mtry", "ntree"))
customRF$grid <- function(x, y, len = NULL, search = "grid") {}
customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
randomForest(x, y, mtry = param$mtry, ntree=param$ntree, ...)}
customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata)
customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata, type = "prob")
customRF$sort <- function(x) x[order(x[,1]),]
customRF$levels <- function(x) x$classes
customRF
grid <- expand.grid(mtry = 23, ntree = c(500, 1000) ) # change the mtry according to the results of the tuneRF function above, I can also the ntree
set.seed(1001)
rdforest<-train(x=Xcalib, # spectral data
y=Ycalib, # factor vector
method=customRF, # random forest algorithm (ustomRF instead of 'rf' to be able to choose the mtry and ntree using a grid)
trControl=ctrl, # ctrl contained cross-validation option
preProc=c("center","scale"), # the data are centered and scaled
metric="ROC", # metric is ROC for 2 classes. Accuracy is used for multiple classes
tuneGrid = grid)
rdforest
以下是结果:
PLS-DA结果
ncomp ROC Sens Spec
47 0.7382311 0.57758621 0.8119994
随机森林结果
mtry ntree ROC Sens Spec
23 500 0.8434449 0.5896552 0.8158085
PLS-DA交叉验证(10倍,重复1次)混淆矩阵
Reference
Prediction pregnant open
pregnant 24.0 11.0
open 17.6 47.4
Accuracy (average) : 0.7145
随机森林交叉验证(10倍,重复1次)混淆矩阵
Reference
Prediction pregnant open
pregnant 25.7 10.0
open 15.9 48.4
Accuracy (average) : 0.7403