插入符号如何确定概率阈值以最大化特异性

时间:2017-08-18 05:22:23

标签: r r-caret

我正在使用插入符号的twoClassSummary函数来确定最佳模型超参数以最大化特异性。但是,该函数如何确定最大化特异性的概率阈值?

基本上每个模型的参数/超参数/折叠是否评估0到1之间的每个阈值并返回最大特异性?在下面的示例中,您可以看到模型已落在cp = 0.01492537上。

# load libraries
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# prepare resampling method
control <- trainControl(method="cv", 
                        number=5, 
                        classProbs=TRUE,
                        summaryFunction=twoClassSummary)

set.seed(7)
fit <- train(diabetes~., 
             data=PimaIndiansDiabetes, 
             method="rpart", 
             tuneLength= 5,
             metric="Spec", 
             trControl=control)

print(fit)


CART 

768 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 614, 614, 615, 615, 614 
Resampling results across tuning parameters:

  cp          ROC        Sens   Spec     
  0.01305970  0.7615943  0.824  0.5937806
  0.01492537  0.7712055  0.824  0.6016073
  0.01741294  0.7544469  0.830  0.5976939
  0.10447761  0.6915783  0.866  0.5035639
  0.24253731  0.6437820  0.884  0.4035639

Spec was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.01492537.

1 个答案:

答案 0 :(得分:2)

不,twoClassSummary不评估0和1之间的每个阈值。它只返回标准阈值0.5的值。

twoClassSummary定义为:

 function (data, lev = NULL, model = NULL) 
{
    lvls <- levels(data$obs)
    if (length(lvls) > 2) 
        stop(paste("Your outcome has", length(lvls), "levels. The twoClassSummary() function isn't appropriate."))
    requireNamespaceQuietStop("ModelMetrics")
    if (!all(levels(data[, "pred"]) == lvls)) 
        stop("levels of observed and predicted data do not match")
    rocAUC <- ModelMetrics::auc(ifelse(data$obs == lev[2], 0, 
        1), data[, lvls[1]])
    out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"], 
        lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
    names(out) <- c("ROC", "Sens", "Spec")
    out
}

要验证我的陈述,请尝试使用自定义summaryFunction的以下示例,其中我明确地将阈值设置为0.5,您将看到两个值Spec(由twoClassSummary报告的原始特异性)和Spec2(具有阈值的特异性)手动设置为0.5)将完全相同:

# load libraries
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)

# define custom summaryFunction
customSummary <- function (data, lev = NULL, model = NULL){
  spec <- specificity(data[, "pred"], data[, "obs"], lev[2])
  pred <- factor(ifelse(data[, "neg"] > 0.5, "neg", "pos"))
  spec2 <- specificity(pred, data[, "obs"], "pos")
  out <- c(spec, spec2)

  names(out) <- c("Spec", "Spec2")
  out
}

# prepare resampling method
control <- trainControl(method="cv", 
                        number=5, 
                        classProbs=TRUE,
                        summaryFunction=customSummary)

set.seed(7)
fit <- train(diabetes~., 
             data=PimaIndiansDiabetes, 
             method="rpart", 
             tuneLength= 5,
             metric="Spec", 
             trControl=control)

print(fit)
CART 

768 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 615, 615, 614, 614, 614 
Resampling results across tuning parameters:

  cp          Spec       Spec2    
  0.01305970  0.5749825  0.5749825
  0.01492537  0.5411600  0.5411600
  0.01741294  0.5596785  0.5596785
  0.10447761  0.4932215  0.4932215
  0.24253731  0.2837177  0.2837177

Spec was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.0130597.

此外,如果您想要计算任何阈值的超参数设置的最大特异性并报告该值,您可以定义一个自定义summaryFunction,如下所示,它将尝试所有阈值,从0.1到0.95步骤0.05:

    # define custom summaryFunction
customSummary <- function (data, lev = NULL, model = NULL){
  spec <- specificity(data[, "pred"], data[, "obs"], lev[2])
  pred <- factor(ifelse(data[, "neg"] > 0.5, "neg", "pos"))
  spec2 <- specificity(pred, data[, "obs"], "pos")
  speclist <- as.numeric()
  for(i in seq(0.1, 0.95, 0.05)){
    predi <- factor(ifelse(data[, "neg"] > i, "neg", "pos"))
    singlespec <- specificity(predi, data[, "obs"], "pos")
    speclist <- c(speclist, singlespec)
  }
  max(speclist) -> specmax

  out <- c(spec, spec2, specmax)

  names(out) <- c("Spec", "Spec2", "Specmax")
  out
}