当模型中的假阳性率是0.5时,真阳性率是多少?

时间:2019-04-07 08:18:14

标签: r roc

我试图了解当模型中FPR为0.5时如何计算真实阳性率,然后生成ROc曲线。但是我在编码方面肯定会遇到一些问题...

> library(nycflights13)
> late_arrival<- flights$arr_delay>50
> summary(late_arrival)
   Mode   FALSE    TRUE    NA's 
logical  275847   51499    9430 
> late_arrival.lr <- glm(late_arrival~carrier+dep_delay+month+year, data=flights, family='binomial')

警告信息: glm.fit: fitted probabilities numerically 0 or 1 occurred

> summary(late_arrival.lr)
Call:
glm(formula = late_arrival ~ carrier + dep_delay + month + year, 
    family = "binomial", data = flights)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0972  -0.2445  -0.1920  -0.1570   3.9217  

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.9122786  0.0430834 -90.807  < 2e-16 ***
carrierAA    0.2174443  0.0485813   4.476 7.61e-06 ***
carrierAS   -0.3549507  0.2540636  -1.397  0.16239    
carrierB6    0.5142442  0.0428985  11.987  < 2e-16 ***
carrierDL    0.2228855  0.0449833   4.955 7.24e-07 ***
carrierEV    0.3230899  0.0431394   7.489 6.92e-14 ***
carrierF9    1.1544420  0.1444764   7.991 1.34e-15 ***
carrierFL    0.7190162  0.0812251   8.852  < 2e-16 ***
carrierHA   -0.2276957  0.4115495  -0.553  0.58008    
carrierMQ    0.8086500  0.0475393  17.010  < 2e-16 ***
carrierOO    1.0138755  0.9037621   1.122  0.26193    
carrierUA    0.0919203  0.0431571   2.130  0.03318 *  
carrierUS    0.6063731  0.0525429  11.541  < 2e-16 ***
carrierVX   -0.0485832  0.0852892  -0.570  0.56893    
carrierWN   -0.1551747  0.0574042  -2.703  0.00687 ** 
carrierYV    0.5737826  0.1999578   2.870  0.00411 ** 
dep_delay    0.1000536  0.0004308 232.263  < 2e-16 ***
month        0.0009126  0.0024337   0.375  0.70767    
year                NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 284924  on 327345  degrees of freedom
Residual deviance: 108708  on 327328  degrees of freedom
AIC: 108744

Number of Fisher Scoring iterations: 7

它不断向我显示此警告:(Dispersion parameter for binomial family taken to be 1)

我实际上如何从这里开始预测情况? 我知道我必须以某种方式得出预测和实际值,才能得出真实的阳性率。有人可以指导我吗? 非常感谢!

1 个答案:

答案 0 :(得分:0)

从模型中删除year,因为它没有变化,请重新拟合模型,然后将flights作为newdata参数传递给模型的predict()方法。 / p>

示例,使用ROC上Wikipedia页面上的术语和缩写:

library(nycflights13)

late_arrival<- flights$arr_delay>50
late_arrival.lr <- glm(late_arrival~carrier+dep_delay+month, data=flights, family='binomial')
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
fit <- predict(late_arrival.lr, newdata = flights, type = "response")

d <- data.frame(late_arrival, fit)

# "Confusion matrix" of actual vs predicted outcomes
# for a cutpoint of 0.5:
xtabs(~ late_arrival + I(fit > 0.5), data = d)
#>             I(fit > 0.5)
#> late_arrival  FALSE   TRUE
#>        FALSE 290637   3091
#>        TRUE    7386  26232

# Now do this for a range of cutpoints.
# Sensitivity = true positive rate = TPR
# Specificity = true negative rate = TNR
# 1 - Specificity = false positive rate = FPR = 1 - TNR
# The ROC plot is
#     x = 1 - Specificity = FPR
#     y = Sensitivity     = TPR

fun <- function(cutpoint) {
    pred <- d$fit > cutpoint
    # cm = "confusion matrix"
    cm <- xtabs(~ late_arrival + I(fit > cutpoint), data = d)
    cm <- as.list(cm)
    names(cm) <- c("TN", "FN", "FP", "TP")
    sens <- with(cm, TP / (TP + FN))
    spec <- with(cm, TN / (TN + FP))
    return(data.frame(cutpoint, sens, spec))
}

# Example output:
fun(0.5)
#>   cutpoint      sens      spec
#> 1      0.5 0.7802963 0.9894767

cutpoints <- seq(0.02, 0.98, by = 0.02)
# This does
# rbind(fun(cutpoints[1]), fun(cutpoints[2], ...)
roc <- do.call(rbind, lapply(cutpoints, fun))
plot(1 - roc$spec, roc$sens, type = "b",
     xlab = "False positive rate (1 - specificity)", 
     ylab = "True positive rate (sensitivity)",
     xlim = c(0, 1),
     ylim = c(0, 1))

reprex package(v0.2.1.9000)于2019-04-07创建

请注意,在回答您的主要问题之前,需要解决几个问题:

示例中year的效果估计为NA,因为此变量没有变化,因此无法估算它的效果。

> unique(flights$year)
[1] 2013

如果删除此预测变量并重新拟合,则输出有意义(意味着不存在NA或巨大的标准误差):

> late_arrival.lr <- glm(late_arrival~carrier+dep_delay+month, data=flights, family='binomial')
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
> coef(summary(late_arrival.lr))
                Estimate   Std. Error     z value     Pr(>|z|)
(Intercept) -5.325540101 0.0564526220 -94.3364527 0.000000e+00
carrierAA    0.335139676 0.0622536491   5.3834543 7.306979e-08
carrierAS   -0.980666348 0.3701250164  -2.6495544 8.059801e-03
carrierB6    0.524971196 0.0542918253   9.6694335 4.066226e-22
carrierDL    0.406813418 0.0576767561   7.0533339 1.746810e-12
carrierEV    0.350366432 0.0535144496   6.5471370 5.865056e-11
carrierF9    0.776012126 0.2084826127   3.7221911 1.975015e-04
carrierFL    0.773647203 0.1077982499   7.1768067 7.135846e-13
carrierHA   -2.225896541 0.8684691013  -2.5630118 1.037685e-02
carrierMQ    0.847415433 0.0601677914  14.0842037 4.749822e-45
carrierOO    0.232324503 1.3043323784   0.1781176 8.586307e-01
carrierUA    0.157191477 0.0549977051   2.8581461 4.261241e-03
carrierUS    0.649304471 0.0697493204   9.3091154 1.289014e-20
carrierVX    0.237994726 0.1131585684   2.1031967 3.544858e-02
carrierWN    0.032542799 0.0736491439   0.4418626 6.585887e-01
carrierYV    0.861814625 0.2373042135   3.6316870 2.815745e-04
dep_delay    0.089655081 0.0004428296 202.4595603 0.000000e+00
month        0.005089147 0.0032449949   1.5683066 1.168096e-01

警告fitted probabilities numerically 0 or 1 occurred通常意味着您的连续值预测变量之一完美地预测了结果。例如:

> x <- c(1, 2, 3)
> y <- c(0, 0, 1)
> coef(summary(glm(y ~ x, family="binomial")))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
              Estimate Std. Error       z value  Pr(>|z|)
(Intercept) -115.57626  226884.08 -0.0005094067 0.9995936
x             46.34447   94156.73  0.0004922056 0.9996073

这里,最好的估计是

P(y = 1)=(如果x <阈值则为0)否则为1

但这会引起两个数值问题:

  • P(y = 1)与x的通常sigmoid曲线现在应该是step function。这需要无限陡峭的S型曲线,因此相对于x的“斜率”趋于无穷大。
  • 任何介于2到3之间的阈值都将同样有效,因此不可能为拦截确定一个最佳估计。

但是,在flights情况下,我认为该警告仅表示警告内容:某些预测是如此确定,以至于舍入误差会丢失任何细微差别。



在检查late_arrival是否确实可以由一个x变量完美预测时,我使用了以下代码:

# Make warnings print as they appear.
# options() returns the previous settings, and we store it
warn <- options(warn = 1)$warn
for (i in c("carrier", "dep_delay", "month", "year")) {
  print(i)
  glm(late_arrival~flights[[i]], family='binomial')
}
# Restore the previous warning setting
options(warn = warn)

可打印

[1] "carrier"
[1] "dep_delay"
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
[1] "month"
[1] "year"

但是plot(flights$dep_delay, late_arrival)(花费几秒钟)表明实际上并没有完全分离,其中所有late_arrival都发生了dep_delay>某个阈值。