我有一个数据集,我需要在其中预测二元结果(死亡=是|否)。
>head(riskScore)
visitid convulsions unable.to.sit death age.less.4.months subjective.fever difficulty.breathing
1 2200120612 no yes no 1 fever 1
2 2202801112 yes yes no 1 fever 1
3 2209440612 no yes yes 1 nofever 0
4 2200820511 no yes yes 1 nofever 1
5 2402430812 yes no no 1 nofever 1
6 2200750512 yes yes no 0 fever 1
altered.consciousness unable.to.drink temp.less.35.5 pallor jaundice deep.breathing unconsciousness meningeal.signs
1 1 1 1 1 0 1 1 0
2 1 1 0 1 0 1 1 1
3 1 1 0 0 0 1 1 1
4 0 1 1 0 0 1 1 0
5 1 1 1 1 0 1 0 0
6 1 1 1 1 0 1 1 1
riskscore riskscorecat
1 10 High>=4
2 10 High>=4
3 9 High>=4
4 9 High>=4
5 9 High>=4
6 9 High>=4
我使用glm运行了我感兴趣的值的逻辑回归,但是当我尝试使用predict()
函数时,我得到的值低于0且高于1.我不确定发生了什么。
>logit<-glm(death~.,data=riskScore[,-c(1,16,17)], family = "binomial")
summary(logit)
Call:
glm(formula = death ~ ., family = "binomial", data = riskScore[,
-c(1, 16, 17)])
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7094 -0.2518 -0.1754 -0.1462 3.0144
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.53265 0.05340 -84.879 < 2e-16 ***
convulsionsyes 0.32229 0.07447 4.328 1.51e-05 ***
unable.to.sityes 0.58151 0.07721 7.532 5.00e-14 ***
age.less.4.months 1.23293 0.06836 18.037 < 2e-16 ***
subjective.fevernofever 0.87071 0.07830 11.120 < 2e-16 ***
difficulty.breathing 0.40963 0.06802 6.022 1.72e-09 ***
altered.consciousness 0.54633 0.09988 5.470 4.50e-08 ***
unable.to.drink 0.24152 0.06974 3.463 0.000534 ***
temp.less.35.5 0.89761 0.11692 7.677 1.62e-14 ***
pallor 0.36524 0.06135 5.954 2.62e-09 ***
jaundice 0.63050 0.13143 4.797 1.61e-06 ***
deep.breathing 0.43571 0.06688 6.515 7.29e-11 ***
unconsciousness 0.94614 0.11055 8.558 < 2e-16 ***
meningeal.signs 0.64848 0.13607 4.766 1.88e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 12588 on 44980 degrees of freedom
Residual deviance: 11046 on 44967 degrees of freedom
(5268 observations deleted due to missingness)
AIC: 11074
Number of Fisher Scoring iterations: 7
以下是来自预测
的值>prediction<-predict(logit, riskScore)
>length(prediction[prediction<0|prediction>1])
[1] 50204
>summary(prediction)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-4.533 -4.533 -4.167 -3.815 -3.340 1.197 5268
>prediction[1:50]
1 2 3 4 5 6 7 8 9 10
1.12398530 1.19715305 0.97069196 1.08311749 0.78933056 0.86182897 1.00974470 0.63536788 0.88719831 0.84384535
11 12 13 14 15 16 17 18 19 20
-0.07189721 0.59472009 1.00740095 0.39651479 0.87304497 0.90688192 0.03133122 0.61936992 0.78546065 0.54866902
21 22 23 24 25 26 27 28 29 30
0.17429220 0.43421440 0.42529536 0.08584600 0.42227834 0.50222038 -0.14955193 -0.10894807 -0.36910262 -0.13585073
31 32 33 34 35 36 37 38 39 40
0.23778021 0.30714403 0.25506022 0.40297594 -0.33391255 -0.49450910 -0.32841837 -0.39808824 -0.35807332 0.11296268
41 42 43 44 45 46 47 48 49 50
-0.03671438 -0.24974281 -0.05376394 -0.67077795 -0.32081810 -0.07350129 -0.35047305 0.13791068 -0.44453688 -0.51857901
此外,当我尝试解决问题时,我从混淆矩阵中收到了此错误消息:
>confusionMatrix(prediction, riskScores$death)
Error in confusionMatrix.default(prediction, riskScore$death) :
the data cannot have more levels than the reference
这些是我用table()而不是ConfusionMatrix()的值,我真的不明白:
> table(prediction, riskScore$death)[1:20,]
prediction no yes
-4.53264842453133 13877 103
-4.29112343685774 1103 10
-4.21035542116192 1788 18
-4.16740465712558 5648 90
-4.12301748166234 939 13
-4.09694208681398 891 17
-3.98631776105867 125 4
-3.96883043348833 148 2
-3.9511349297142 307 10
-3.92587966945199 593 19
-3.90214801699655 85 1
-3.88416439049418 38 1
-3.88149249398875 128 1
-3.8554170991404 113 4
-3.84511165375617 765 19
-3.80072447829293 129 5
-3.77464908344457 132 3
-3.75777371425659 396 10
-3.74479277338508 27 0
-3.73169831940823 840 20
任何人都可以帮助我了解预测发生了什么以及为什么我没有2x2表是/否x是/否?