我试图了解当模型中FPR为0.5时如何计算真实阳性率,然后生成ROc曲线。但是我在编码方面肯定会遇到一些问题...
> library(nycflights13)
> late_arrival<- flights$arr_delay>50
> summary(late_arrival)
Mode FALSE TRUE NA's
logical 275847 51499 9430
> late_arrival.lr <- glm(late_arrival~carrier+dep_delay+month+year, data=flights, family='binomial')
警告信息:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(late_arrival.lr)
Call:
glm(formula = late_arrival ~ carrier + dep_delay + month + year,
family = "binomial", data = flights)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0972 -0.2445 -0.1920 -0.1570 3.9217
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.9122786 0.0430834 -90.807 < 2e-16 ***
carrierAA 0.2174443 0.0485813 4.476 7.61e-06 ***
carrierAS -0.3549507 0.2540636 -1.397 0.16239
carrierB6 0.5142442 0.0428985 11.987 < 2e-16 ***
carrierDL 0.2228855 0.0449833 4.955 7.24e-07 ***
carrierEV 0.3230899 0.0431394 7.489 6.92e-14 ***
carrierF9 1.1544420 0.1444764 7.991 1.34e-15 ***
carrierFL 0.7190162 0.0812251 8.852 < 2e-16 ***
carrierHA -0.2276957 0.4115495 -0.553 0.58008
carrierMQ 0.8086500 0.0475393 17.010 < 2e-16 ***
carrierOO 1.0138755 0.9037621 1.122 0.26193
carrierUA 0.0919203 0.0431571 2.130 0.03318 *
carrierUS 0.6063731 0.0525429 11.541 < 2e-16 ***
carrierVX -0.0485832 0.0852892 -0.570 0.56893
carrierWN -0.1551747 0.0574042 -2.703 0.00687 **
carrierYV 0.5737826 0.1999578 2.870 0.00411 **
dep_delay 0.1000536 0.0004308 232.263 < 2e-16 ***
month 0.0009126 0.0024337 0.375 0.70767
year NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 284924 on 327345 degrees of freedom
Residual deviance: 108708 on 327328 degrees of freedom
AIC: 108744
Number of Fisher Scoring iterations: 7
它不断向我显示此警告:(Dispersion parameter for binomial family taken to be 1)
我实际上如何从这里开始预测情况? 我知道我必须以某种方式得出预测和实际值,才能得出真实的阳性率。有人可以指导我吗? 非常感谢!
答案 0 :(得分:0)
从模型中删除year
,因为它没有变化,请重新拟合模型,然后将flights
作为newdata
参数传递给模型的predict()
方法。 / p>
示例,使用ROC上Wikipedia页面上的术语和缩写:
library(nycflights13)
late_arrival<- flights$arr_delay>50
late_arrival.lr <- glm(late_arrival~carrier+dep_delay+month, data=flights, family='binomial')
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
fit <- predict(late_arrival.lr, newdata = flights, type = "response")
d <- data.frame(late_arrival, fit)
# "Confusion matrix" of actual vs predicted outcomes
# for a cutpoint of 0.5:
xtabs(~ late_arrival + I(fit > 0.5), data = d)
#> I(fit > 0.5)
#> late_arrival FALSE TRUE
#> FALSE 290637 3091
#> TRUE 7386 26232
# Now do this for a range of cutpoints.
# Sensitivity = true positive rate = TPR
# Specificity = true negative rate = TNR
# 1 - Specificity = false positive rate = FPR = 1 - TNR
# The ROC plot is
# x = 1 - Specificity = FPR
# y = Sensitivity = TPR
fun <- function(cutpoint) {
pred <- d$fit > cutpoint
# cm = "confusion matrix"
cm <- xtabs(~ late_arrival + I(fit > cutpoint), data = d)
cm <- as.list(cm)
names(cm) <- c("TN", "FN", "FP", "TP")
sens <- with(cm, TP / (TP + FN))
spec <- with(cm, TN / (TN + FP))
return(data.frame(cutpoint, sens, spec))
}
# Example output:
fun(0.5)
#> cutpoint sens spec
#> 1 0.5 0.7802963 0.9894767
cutpoints <- seq(0.02, 0.98, by = 0.02)
# This does
# rbind(fun(cutpoints[1]), fun(cutpoints[2], ...)
roc <- do.call(rbind, lapply(cutpoints, fun))
plot(1 - roc$spec, roc$sens, type = "b",
xlab = "False positive rate (1 - specificity)",
ylab = "True positive rate (sensitivity)",
xlim = c(0, 1),
ylim = c(0, 1))
由reprex package(v0.2.1.9000)于2019-04-07创建
请注意,在回答您的主要问题之前,需要解决几个问题:
示例中year
的效果估计为NA
,因为此变量没有变化,因此无法估算它的效果。>
> unique(flights$year)
[1] 2013
如果删除此预测变量并重新拟合,则输出有意义(意味着不存在NA或巨大的标准误差):
> late_arrival.lr <- glm(late_arrival~carrier+dep_delay+month, data=flights, family='binomial')
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
> coef(summary(late_arrival.lr))
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.325540101 0.0564526220 -94.3364527 0.000000e+00
carrierAA 0.335139676 0.0622536491 5.3834543 7.306979e-08
carrierAS -0.980666348 0.3701250164 -2.6495544 8.059801e-03
carrierB6 0.524971196 0.0542918253 9.6694335 4.066226e-22
carrierDL 0.406813418 0.0576767561 7.0533339 1.746810e-12
carrierEV 0.350366432 0.0535144496 6.5471370 5.865056e-11
carrierF9 0.776012126 0.2084826127 3.7221911 1.975015e-04
carrierFL 0.773647203 0.1077982499 7.1768067 7.135846e-13
carrierHA -2.225896541 0.8684691013 -2.5630118 1.037685e-02
carrierMQ 0.847415433 0.0601677914 14.0842037 4.749822e-45
carrierOO 0.232324503 1.3043323784 0.1781176 8.586307e-01
carrierUA 0.157191477 0.0549977051 2.8581461 4.261241e-03
carrierUS 0.649304471 0.0697493204 9.3091154 1.289014e-20
carrierVX 0.237994726 0.1131585684 2.1031967 3.544858e-02
carrierWN 0.032542799 0.0736491439 0.4418626 6.585887e-01
carrierYV 0.861814625 0.2373042135 3.6316870 2.815745e-04
dep_delay 0.089655081 0.0004428296 202.4595603 0.000000e+00
month 0.005089147 0.0032449949 1.5683066 1.168096e-01
警告fitted probabilities numerically 0 or 1 occurred
通常意味着您的连续值预测变量之一完美地预测了结果。例如:
> x <- c(1, 2, 3)
> y <- c(0, 0, 1)
> coef(summary(glm(y ~ x, family="binomial")))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Estimate Std. Error z value Pr(>|z|)
(Intercept) -115.57626 226884.08 -0.0005094067 0.9995936
x 46.34447 94156.73 0.0004922056 0.9996073
这里,最好的估计是
P(y = 1)=(如果x <阈值则为0)否则为1
但这会引起两个数值问题:
但是,在flights
情况下,我认为该警告仅表示警告内容:某些预测是如此确定,以至于舍入误差会丢失任何细微差别。
在检查late_arrival
是否确实可以由一个x变量完美预测时,我使用了以下代码:
# Make warnings print as they appear.
# options() returns the previous settings, and we store it
warn <- options(warn = 1)$warn
for (i in c("carrier", "dep_delay", "month", "year")) {
print(i)
glm(late_arrival~flights[[i]], family='binomial')
}
# Restore the previous warning setting
options(warn = warn)
可打印
[1] "carrier"
[1] "dep_delay"
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
[1] "month"
[1] "year"
但是plot(flights$dep_delay, late_arrival)
(花费几秒钟)表明实际上并没有完全分离,其中所有late_arrival
都发生了dep_delay
>某个阈值。