Question

我有一个数据集，我需要在其中预测二元结果（死亡=是|否）。

   >head(riskScore)
       visitid convulsions unable.to.sit death age.less.4.months subjective.fever difficulty.breathing
1 2200120612          no           yes    no                 1            fever                    1
2 2202801112         yes           yes    no                 1            fever                    1
3 2209440612          no           yes   yes                 1          nofever                    0
4 2200820511          no           yes   yes                 1          nofever                    1
5 2402430812         yes            no    no                 1          nofever                    1
6 2200750512         yes           yes    no                 0            fever                    1
  altered.consciousness unable.to.drink temp.less.35.5 pallor jaundice deep.breathing unconsciousness meningeal.signs
1                     1               1              1      1        0              1               1               0
2                     1               1              0      1        0              1               1               1
3                     1               1              0      0        0              1               1               1
4                     0               1              1      0        0              1               1               0
5                     1               1              1      1        0              1               0               0
6                     1               1              1      1        0              1               1               1
  riskscore riskscorecat
1        10      High>=4
2        10      High>=4
3         9      High>=4
4         9      High>=4
5         9      High>=4
6         9      High>=4

我使用glm运行了我感兴趣的值的逻辑回归，但是当我尝试使用predict()函数时，我得到的值低于0且高于1.我不确定发生了什么。

>logit<-glm(death~.,data=riskScore[,-c(1,16,17)], family = "binomial")
        summary(logit)

Call:
glm(formula = death ~ ., family = "binomial", data = riskScore[, 
    -c(1, 16, 17)])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7094  -0.2518  -0.1754  -0.1462   3.0144  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -4.53265    0.05340 -84.879  < 2e-16 ***
convulsionsyes           0.32229    0.07447   4.328 1.51e-05 ***
unable.to.sityes         0.58151    0.07721   7.532 5.00e-14 ***
age.less.4.months        1.23293    0.06836  18.037  < 2e-16 ***
subjective.fevernofever  0.87071    0.07830  11.120  < 2e-16 ***
difficulty.breathing     0.40963    0.06802   6.022 1.72e-09 ***
altered.consciousness    0.54633    0.09988   5.470 4.50e-08 ***
unable.to.drink          0.24152    0.06974   3.463 0.000534 ***
temp.less.35.5           0.89761    0.11692   7.677 1.62e-14 ***
pallor                   0.36524    0.06135   5.954 2.62e-09 ***
jaundice                 0.63050    0.13143   4.797 1.61e-06 ***
deep.breathing           0.43571    0.06688   6.515 7.29e-11 ***
unconsciousness          0.94614    0.11055   8.558  < 2e-16 ***
meningeal.signs          0.64848    0.13607   4.766 1.88e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 12588  on 44980  degrees of freedom
Residual deviance: 11046  on 44967  degrees of freedom
  (5268 observations deleted due to missingness)
AIC: 11074

Number of Fisher Scoring iterations: 7

以下是来自预测

的值

>prediction<-predict(logit, riskScore)
>length(prediction[prediction<0|prediction>1])
[1] 50204
>summary(prediction)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 -4.533  -4.533  -4.167  -3.815  -3.340   1.197    5268 
>prediction[1:50]
          1           2           3           4           5           6           7           8           9          10 
 1.12398530  1.19715305  0.97069196  1.08311749  0.78933056  0.86182897  1.00974470  0.63536788  0.88719831  0.84384535 
         11          12          13          14          15          16          17          18          19          20 
-0.07189721  0.59472009  1.00740095  0.39651479  0.87304497  0.90688192  0.03133122  0.61936992  0.78546065  0.54866902 
         21          22          23          24          25          26          27          28          29          30 
 0.17429220  0.43421440  0.42529536  0.08584600  0.42227834  0.50222038 -0.14955193 -0.10894807 -0.36910262 -0.13585073 
         31          32          33          34          35          36          37          38          39          40 
 0.23778021  0.30714403  0.25506022  0.40297594 -0.33391255 -0.49450910 -0.32841837 -0.39808824 -0.35807332  0.11296268 
         41          42          43          44          45          46          47          48          49          50 
-0.03671438 -0.24974281 -0.05376394 -0.67077795 -0.32081810 -0.07350129 -0.35047305  0.13791068 -0.44453688 -0.51857901

此外，当我尝试解决问题时，我从混淆矩阵中收到了此错误消息：

        >confusionMatrix(prediction, riskScores$death)
Error in confusionMatrix.default(prediction, riskScore$death) : 
  the data cannot have more levels than the reference

这些是我用table（）而不是ConfusionMatrix（）的值，我真的不明白：

> table(prediction, riskScore$death)[1:20,]

prediction             no yes
  -4.53264842453133 13877 103
  -4.29112343685774  1103  10
  -4.21035542116192  1788  18
  -4.16740465712558  5648  90
  -4.12301748166234   939  13
  -4.09694208681398   891  17
  -3.98631776105867   125   4
  -3.96883043348833   148   2
  -3.9511349297142    307  10
  -3.92587966945199   593  19
  -3.90214801699655    85   1
  -3.88416439049418    38   1
  -3.88149249398875   128   1
  -3.8554170991404    113   4
  -3.84511165375617   765  19
  -3.80072447829293   129   5
  -3.77464908344457   132   3
  -3.75777371425659   396  10
  -3.74479277338508    27   0
  -3.73169831940823   840  20

任何人都可以帮助我了解预测发生了什么以及为什么我没有2x2表是/否x是/否？

对于逻辑回归，prediction（）返回大于1的值

0 个答案: