模型提供100%的准确性,随机森林,logit,C5.0?

时间:2015-08-04 09:46:35

标签: r regression random-forest prediction logarithm

当试图拟合模型来预测结果"死亡"我有100%的准确率,这显然是错误的。有人能告诉我我错过了什么吗?

library(caret)
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]

control <- trainControl(method="repeatedcv", repeats=3, number=5)
#C5.0 decision tree
set.seed(100)
modelC50 <- train(death~., data=training_Score, method="C5.0",trControl=control)
summary(modelC50)

#Call:
#C5.0.default(x = structure(c(3, 4, 2, 30, 4, 12, 156, 0.0328767150640488, 36, 0.164383560419083, 22,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 


#C5.0 [Release 2.07 GPL Edition]    Tue Aug  4 10:23:10 2015
#-------------------------------

#Class specified by attribute `outcome'

#Read 27875 cases (23 attributes) from undefined.data

#21 attributes winnowed
#Estimated importance of remaining attributes:

#-2147483648%  no.subjective.fevernofever

#Rules:

#Rule 1: (26982, lift 1.0)
#   no.subjective.fevernofever <= 0
#   ->  class no  [1.000]

#Rule 2: (893, lift 31.2)
#   no.subjective.fevernofever > 0
#   ->  class yes  [0.999]

#Default class: no


#Evaluation on training data (27875 cases):

#           Rules     
#     ----------------
#       No      Errors

#        2    0( 0.0%)   <<


#      (a)   (b)    <-classified as
#     ----  ----
#    26982          (a): class no
#            893    (b): class yes


#   Attribute usage:

#   100.00% no.subjective.fevernofever


#Time: 0.1 secs


confusionMatrix(predictC50, testing_Score$death)

#Confusion Matrix and Statistics

#          Reference
#Prediction    no   yes
#       no  17988     0
#       yes     0   595

#               Accuracy : 1          
#                 95% CI : (0.9998, 1)
#    No Information Rate : 0.968      
#    P-Value [Acc > NIR] : < 2.2e-16  

#                  Kappa : 1          
# Mcnemar's Test P-Value : NA         

#            Sensitivity : 1.000      
#            Specificity : 1.000      
#         Pos Pred Value : 1.000      
#         Neg Pred Value : 1.000      
#             Prevalence : 0.968      
#         Detection Rate : 0.968      
#   Detection Prevalence : 0.968      
#      Balanced Accuracy : 1.000      

#       'Positive' Class : no       

对于随机森林模型

set.seed(100)
modelRF <- train(death~., data=training_Score, method="rf", trControl=control)
predictRF <- predict(modelRF,testing_Score)
confusionMatrix(predictRF, testing_Score$death)

#Confusion Matrix and Statistics
#
#          Reference
#Prediction    no   yes
#       no  17988     0
#       yes     0   595

#               Accuracy : 1          
#                 95% CI : (0.9998, 1)
#    No Information Rate : 0.968      
#    P-Value [Acc > NIR] : < 2.2e-16  

#                  Kappa : 1          
# Mcnemar's Test P-Value : NA         

#            Sensitivity : 1.000      
#            Specificity : 1.000      
#         Pos Pred Value : 1.000      
#         Neg Pred Value : 1.000      
#             Prevalence : 0.968      
#         Detection Rate : 0.968      
#   Detection Prevalence : 0.968      
#      Balanced Accuracy : 1.000      

#       'Positive' Class : no         

predictRFprobs <- predict(modelRF, testing_Score, type = "prob")

对于Logit模型

set.seed(100)
modelLOGIT <- train(death~., data=training_Score,method="glm",family="binomial", trControl=control)
summary(modelLOGIT)

#Call:
#NULL

#Deviance Residuals: 
#       Min          1Q      Median          3Q         Max  
#-2.409e-06  -2.409e-06  -2.409e-06  -2.409e-06   2.409e-06  

#Coefficients:
#                             Estimate Std. Error z value Pr(>|z|)
#(Intercept)                -2.657e+01  7.144e+04   0.000    1.000
#age.in.months               3.554e-15  7.681e+01   0.000    1.000
#temp                       -1.916e-13  1.885e+03   0.000    1.000
#genderfemale                3.644e-14  4.290e+03   0.000    1.000
#no.subjective.fevernofever  5.313e+01  1.237e+04   0.004    0.997
#palloryes                  -1.156e-13  4.747e+03   0.000    1.000
#jaundiceyes                -2.330e-12  1.142e+04   0.000    1.000
#vomitingyes                 1.197e-13  4.791e+03   0.000    1.000
#diarrheayes                -3.043e-13  4.841e+03   0.000    1.000
#dark.urineyes              -6.958e-13  1.037e+04   0.000    1.000
#intercostal.retractionyes   2.851e-13  1.003e+04   0.000    1.000
#subcostal.retractionyes     7.414e-13  1.012e+04   0.000    1.000
#wheezingyes                -1.756e-12  1.091e+04   0.000    1.000
#rhonchiyes                 -1.659e-12  1.074e+04   0.000    1.000
#difficulty.breathingyes     4.496e-13  6.504e+03   0.000    1.000
#deep.breathingyes           1.086e-12  7.075e+03   0.000    1.000
#convulsionsyes             -1.294e-12  6.424e+03   0.000    1.000
#lethargyyes                -4.338e-13  6.188e+03   0.000    1.000
#unable.to.sityes           -4.284e-13  8.118e+03   0.000    1.000
#unable.to.drinkyes          7.297e-13  6.507e+03   0.000    1.000
#altered.consciousnessyes    2.907e-12  1.071e+04   0.000    1.000
#unconsciousnessyes          2.868e-11  1.505e+04   0.000    1.000
#meningeal.signsyes         -1.177e-11  1.570e+04   0.000    1.000

#(Dispersion parameter for binomial family taken to be 1)

#    Null deviance: 7.9025e+03  on 27874  degrees of freedom
#Residual deviance: 1.6172e-07  on 27852  degrees of freedom
#AIC: 46

#Number of Fisher Scoring iterations: 25

#predictLOGIT<-predict(modelLOGIT,testing_Score)

confusionMatrix(predictLOGIT, testing_Score$death)

#Confusion Matrix and Statistics

#          Reference
#Prediction    no   yes
#       no  17988     0
#       yes     0   595

#               Accuracy : 1          
#                 95% CI : (0.9998, 1)
#    No Information Rate : 0.968      
#    P-Value [Acc > NIR] : < 2.2e-16  

#                  Kappa : 1          
# Mcnemar's Test P-Value : NA         

#            Sensitivity : 1.000      
#            Specificity : 1.000      
#         Pos Pred Value : 1.000      
#         Neg Pred Value : 1.000      
#             Prevalence : 0.968      
#         Detection Rate : 0.968      
#   Detection Prevalence : 0.968      
#      Balanced Accuracy : 1.000      

#       'Positive' Class : no    

切片前的数据是:

str(riskFinal)
#'data.frame':  46458 obs. of  23 variables:
# $ age.in.months         : num  3 3 4 2 1.16 ...
# $ temp                  : num  35.5 39.4 36.8 35.2 35 34.3 37.2 35.2 34.6 35.3 ...
# $ gender                : Factor w/ 2 levels "male","female": 1 2 2 2 1 1 1 2 1 1 ...
# $ no.subjective.fever   : Factor w/ 2 levels "fever","nofever": 1 1 2 2 1 1 2 2 2 1 ...
# $ pallor                : Factor w/ 2 levels "no","yes": 2 2 1 1 2 2 2 1 2 2 ...
# $ jaundice              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ vomiting              : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 2 1 1 ...
# $ diarrhea              : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
# $ dark.urine            : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ intercostal.retraction: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 1 2 ...
# $ subcostal.retraction  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 1 1 ...
# $ wheezing              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# $ rhonchi               : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
# $ difficulty.breathing  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 1 1 1 2 ...
# $ deep.breathing        : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 2 ...
# $ convulsions           : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 2 1 2 2 ...
# $ lethargy              : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unable.to.sit         : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ unable.to.drink       : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
# $ altered.consciousness : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unconsciousness       : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ meningeal.signs       : Factor w/ 2 levels "no","yes": 1 2 2 1 1 2 1 2 2 1 ...
# $ death                 : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 2 2 2 1 ...

编辑:根据评论,我意识到no.subjective.fever变量与目标变量死亡具有完全相同的值,因此我将其从模型中排除。然后我得到了更奇怪的结果:

随机森林

set.seed(100)
        nmodelRF<- train(death~.-no.subjective.fever, data=training_Score, method="rf", trControl=control)
        summary(nmodelRF)
        npredictRF<-predict(nmodelRF,testing_Score)
> confusionMatrix(npredictRF, testing_Score$death)


 # Confusion Matrix and Statistics
   # 
   #           Reference
   # Prediction    no   yes
   #        no  17988   595
   #        yes     0     0
   #                                           
   #               Accuracy : 0.968           
   #                  95% CI : (0.9653, 0.9705)
   #     No Information Rate : 0.968           
   #     P-Value [Acc > NIR] : 0.5109          
   #                                           
   #                   Kappa : 0               
   #  Mcnemar's Test P-Value : <2e-16          
   #                                           
   #             Sensitivity : 1.000           
   #             Specificity : 0.000           
   #          Pos Pred Value : 0.968           
   #          Neg Pred Value :   NaN           
   #              Prevalence : 0.968           
   #          Detection Rate : 0.968           
   #    Detection Prevalence : 1.000           
   #       Balanced Accuracy : 0.500           
   #                                           
   #        'Positive' Class : no 


Logit

set.seed(100)
        nmodelLOGIT<- train(death~.-no.subjective.fever, data=training_Score,method="glm",family="binomial", trControl=control)
        >summary(nmodelLOGIT)

# Call:
#         NULL
# 
# Deviance Residuals: 
#         Min       1Q   Median       3Q      Max  
# -1.5113  -0.2525  -0.2041  -0.1676   3.1698  
# 
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)    
# (Intercept)                2.432065   1.084942   2.242 0.024984 *  
#age.in.months             -0.001047   0.001293  -0.810 0.417874    
#temp                      -0.168704   0.028815  -5.855 4.78e-09 ***
#genderfemale              -0.053306   0.070468  -0.756 0.449375    
#palloryes                  0.282123   0.076518   3.687 0.000227 ***
#jaundiceyes                0.323755   0.144607   2.239 0.025165 *  
#vomitingyes               -0.533661   0.082948  -6.434 1.25e-10 ***
#diarrheayes               -0.040272   0.080417  -0.501 0.616520    
#dark.urineyes             -0.583666   0.168787  -3.458 0.000544 ***
#intercostal.retractionyes -0.021717   0.129607  -0.168 0.866926    
#subcostal.retractionyes    0.269588   0.128772   2.094 0.036301 *  
#wheezingyes               -0.587940   0.150475  -3.907 9.34e-05 ***
#rhonchiyes                -0.008565   0.140095  -0.061 0.951249    
#difficulty.breathingyes    0.397394   0.087789   4.527 5.99e-06 ***
#deep.breathingyes          0.399302   0.098761   4.043 5.28e-05 ***
#convulsionsyes             0.132609   0.094038   1.410 0.158491    
#lethargyyes                0.338599   0.089934   3.765 0.000167 ***
#unable.to.sityes           0.452111   0.104556   4.324 1.53e-05 ***
#unable.to.drinkyes         0.516878   0.089685   5.763 8.25e-09 ***
#altered.consciousnessyes   0.433672   0.123288   3.518 0.000436 ***
#unconsciousnessyes         0.754012   0.136105   5.540 3.03e-08 ***
#meningeal.signsyes         0.188823   0.161088   1.172 0.241130    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
# Null deviance: 7902.5  on 27874  degrees of freedom
# Residual deviance: 7148.5  on 27853  degrees of freedom
# AIC: 7192.5
# 
# Number of Fisher Scoring iterations: 6

npredictLOGIT<-predict(nmodelLOGIT,testing_Score)
        >confusionMatrix(npredictLOGIT, testing_Score$death)

# Confusion Matrix and Statistics
# 
# Reference
# Prediction    no   yes
# no  17982   592
# yes     6     3
# 
# Accuracy : 0.9678          
# 95% CI : (0.9652, 0.9703)
# No Information Rate : 0.968           
# P-Value [Acc > NIR] : 0.5605          
# 
# Kappa : 0.009           
# Mcnemar's Test P-Value : <2e-16          
# 
# Sensitivity : 0.999666        
# Specificity : 0.005042        
# Pos Pred Value : 0.968127        
# Neg Pred Value : 0.333333        
# Prevalence : 0.967981        
# Detection Rate : 0.967659        
# Detection Prevalence : 0.999516        
# Balanced Accuracy : 0.502354        
# 
# 'Positive' Class : no  

1 个答案:

答案 0 :(得分:2)

100%准确度结果可能不正确。我假设它们是由于目标变量(或与目标变量基本上具有相同条目的另一个变量,如@ulfelder的注释中所指出的)包含在训练集和测试集中。通常这些列需要在模型构建和测试过程中被移除,因为它们代表描述分类的目标,而列车/测试数据应该仅包含(希望)根据目标变量导致正确分类的信息。

您可以尝试以下方法:

target <- riskFinal$death
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
train_target <- training_Score$death
test_target <- test_Score$death
training_Score <- training_Score[,-which(colnames(training_Score)=="death")]
test_Score <- test_Score[,-which(colnames(test_Score)=="death")]
modelRF <- train(training_Score, train_target, method="rf", trControl=control)

然后你可以像以前那样继续进行,注意目标“死亡”存储在变量train_targettest_target中。

希望这有帮助。