当试图拟合模型来预测结果"死亡"我有100%的准确率,这显然是错误的。有人能告诉我我错过了什么吗?
library(caret)
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
control <- trainControl(method="repeatedcv", repeats=3, number=5)
#C5.0 decision tree
set.seed(100)
modelC50 <- train(death~., data=training_Score, method="C5.0",trControl=control)
summary(modelC50)
#Call:
#C5.0.default(x = structure(c(3, 4, 2, 30, 4, 12, 156, 0.0328767150640488, 36, 0.164383560419083, 22,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
# 0, 0, 0, 0,
#C5.0 [Release 2.07 GPL Edition] Tue Aug 4 10:23:10 2015
#-------------------------------
#Class specified by attribute `outcome'
#Read 27875 cases (23 attributes) from undefined.data
#21 attributes winnowed
#Estimated importance of remaining attributes:
#-2147483648% no.subjective.fevernofever
#Rules:
#Rule 1: (26982, lift 1.0)
# no.subjective.fevernofever <= 0
# -> class no [1.000]
#Rule 2: (893, lift 31.2)
# no.subjective.fevernofever > 0
# -> class yes [0.999]
#Default class: no
#Evaluation on training data (27875 cases):
# Rules
# ----------------
# No Errors
# 2 0( 0.0%) <<
# (a) (b) <-classified as
# ---- ----
# 26982 (a): class no
# 893 (b): class yes
# Attribute usage:
# 100.00% no.subjective.fevernofever
#Time: 0.1 secs
confusionMatrix(predictC50, testing_Score$death)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 17988 0
# yes 0 595
# Accuracy : 1
# 95% CI : (0.9998, 1)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : < 2.2e-16
# Kappa : 1
# Mcnemar's Test P-Value : NA
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 0.968
# Balanced Accuracy : 1.000
# 'Positive' Class : no
对于随机森林模型
set.seed(100)
modelRF <- train(death~., data=training_Score, method="rf", trControl=control)
predictRF <- predict(modelRF,testing_Score)
confusionMatrix(predictRF, testing_Score$death)
#Confusion Matrix and Statistics
#
# Reference
#Prediction no yes
# no 17988 0
# yes 0 595
# Accuracy : 1
# 95% CI : (0.9998, 1)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : < 2.2e-16
# Kappa : 1
# Mcnemar's Test P-Value : NA
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 0.968
# Balanced Accuracy : 1.000
# 'Positive' Class : no
predictRFprobs <- predict(modelRF, testing_Score, type = "prob")
对于Logit模型
set.seed(100)
modelLOGIT <- train(death~., data=training_Score,method="glm",family="binomial", trControl=control)
summary(modelLOGIT)
#Call:
#NULL
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-2.409e-06 -2.409e-06 -2.409e-06 -2.409e-06 2.409e-06
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -2.657e+01 7.144e+04 0.000 1.000
#age.in.months 3.554e-15 7.681e+01 0.000 1.000
#temp -1.916e-13 1.885e+03 0.000 1.000
#genderfemale 3.644e-14 4.290e+03 0.000 1.000
#no.subjective.fevernofever 5.313e+01 1.237e+04 0.004 0.997
#palloryes -1.156e-13 4.747e+03 0.000 1.000
#jaundiceyes -2.330e-12 1.142e+04 0.000 1.000
#vomitingyes 1.197e-13 4.791e+03 0.000 1.000
#diarrheayes -3.043e-13 4.841e+03 0.000 1.000
#dark.urineyes -6.958e-13 1.037e+04 0.000 1.000
#intercostal.retractionyes 2.851e-13 1.003e+04 0.000 1.000
#subcostal.retractionyes 7.414e-13 1.012e+04 0.000 1.000
#wheezingyes -1.756e-12 1.091e+04 0.000 1.000
#rhonchiyes -1.659e-12 1.074e+04 0.000 1.000
#difficulty.breathingyes 4.496e-13 6.504e+03 0.000 1.000
#deep.breathingyes 1.086e-12 7.075e+03 0.000 1.000
#convulsionsyes -1.294e-12 6.424e+03 0.000 1.000
#lethargyyes -4.338e-13 6.188e+03 0.000 1.000
#unable.to.sityes -4.284e-13 8.118e+03 0.000 1.000
#unable.to.drinkyes 7.297e-13 6.507e+03 0.000 1.000
#altered.consciousnessyes 2.907e-12 1.071e+04 0.000 1.000
#unconsciousnessyes 2.868e-11 1.505e+04 0.000 1.000
#meningeal.signsyes -1.177e-11 1.570e+04 0.000 1.000
#(Dispersion parameter for binomial family taken to be 1)
# Null deviance: 7.9025e+03 on 27874 degrees of freedom
#Residual deviance: 1.6172e-07 on 27852 degrees of freedom
#AIC: 46
#Number of Fisher Scoring iterations: 25
#predictLOGIT<-predict(modelLOGIT,testing_Score)
confusionMatrix(predictLOGIT, testing_Score$death)
#Confusion Matrix and Statistics
# Reference
#Prediction no yes
# no 17988 0
# yes 0 595
# Accuracy : 1
# 95% CI : (0.9998, 1)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : < 2.2e-16
# Kappa : 1
# Mcnemar's Test P-Value : NA
# Sensitivity : 1.000
# Specificity : 1.000
# Pos Pred Value : 1.000
# Neg Pred Value : 1.000
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 0.968
# Balanced Accuracy : 1.000
# 'Positive' Class : no
切片前的数据是:
str(riskFinal)
#'data.frame': 46458 obs. of 23 variables:
# $ age.in.months : num 3 3 4 2 1.16 ...
# $ temp : num 35.5 39.4 36.8 35.2 35 34.3 37.2 35.2 34.6 35.3 ...
# $ gender : Factor w/ 2 levels "male","female": 1 2 2 2 1 1 1 2 1 1 ...
# $ no.subjective.fever : Factor w/ 2 levels "fever","nofever": 1 1 2 2 1 1 2 2 2 1 ...
# $ pallor : Factor w/ 2 levels "no","yes": 2 2 1 1 2 2 2 1 2 2 ...
# $ jaundice : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ vomiting : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 2 1 1 ...
# $ diarrhea : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
# $ dark.urine : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
# $ intercostal.retraction: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 1 2 ...
# $ subcostal.retraction : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 1 1 ...
# $ wheezing : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# $ rhonchi : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
# $ difficulty.breathing : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 1 1 1 2 ...
# $ deep.breathing : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 2 ...
# $ convulsions : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 2 1 2 2 ...
# $ lethargy : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unable.to.sit : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ unable.to.drink : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
# $ altered.consciousness : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
# $ unconsciousness : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
# $ meningeal.signs : Factor w/ 2 levels "no","yes": 1 2 2 1 1 2 1 2 2 1 ...
# $ death : Factor w/ 2 levels "no","yes": 1 1 2 2 1 1 2 2 2 1 ...
编辑:根据评论,我意识到no.subjective.fever变量与目标变量死亡具有完全相同的值,因此我将其从模型中排除。然后我得到了更奇怪的结果:
随机森林
set.seed(100)
nmodelRF<- train(death~.-no.subjective.fever, data=training_Score, method="rf", trControl=control)
summary(nmodelRF)
npredictRF<-predict(nmodelRF,testing_Score)
> confusionMatrix(npredictRF, testing_Score$death)
# Confusion Matrix and Statistics
#
# Reference
# Prediction no yes
# no 17988 595
# yes 0 0
#
# Accuracy : 0.968
# 95% CI : (0.9653, 0.9705)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : 0.5109
#
# Kappa : 0
# Mcnemar's Test P-Value : <2e-16
#
# Sensitivity : 1.000
# Specificity : 0.000
# Pos Pred Value : 0.968
# Neg Pred Value : NaN
# Prevalence : 0.968
# Detection Rate : 0.968
# Detection Prevalence : 1.000
# Balanced Accuracy : 0.500
#
# 'Positive' Class : no
Logit
set.seed(100)
nmodelLOGIT<- train(death~.-no.subjective.fever, data=training_Score,method="glm",family="binomial", trControl=control)
>summary(nmodelLOGIT)
# Call:
# NULL
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -1.5113 -0.2525 -0.2041 -0.1676 3.1698
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.432065 1.084942 2.242 0.024984 *
#age.in.months -0.001047 0.001293 -0.810 0.417874
#temp -0.168704 0.028815 -5.855 4.78e-09 ***
#genderfemale -0.053306 0.070468 -0.756 0.449375
#palloryes 0.282123 0.076518 3.687 0.000227 ***
#jaundiceyes 0.323755 0.144607 2.239 0.025165 *
#vomitingyes -0.533661 0.082948 -6.434 1.25e-10 ***
#diarrheayes -0.040272 0.080417 -0.501 0.616520
#dark.urineyes -0.583666 0.168787 -3.458 0.000544 ***
#intercostal.retractionyes -0.021717 0.129607 -0.168 0.866926
#subcostal.retractionyes 0.269588 0.128772 2.094 0.036301 *
#wheezingyes -0.587940 0.150475 -3.907 9.34e-05 ***
#rhonchiyes -0.008565 0.140095 -0.061 0.951249
#difficulty.breathingyes 0.397394 0.087789 4.527 5.99e-06 ***
#deep.breathingyes 0.399302 0.098761 4.043 5.28e-05 ***
#convulsionsyes 0.132609 0.094038 1.410 0.158491
#lethargyyes 0.338599 0.089934 3.765 0.000167 ***
#unable.to.sityes 0.452111 0.104556 4.324 1.53e-05 ***
#unable.to.drinkyes 0.516878 0.089685 5.763 8.25e-09 ***
#altered.consciousnessyes 0.433672 0.123288 3.518 0.000436 ***
#unconsciousnessyes 0.754012 0.136105 5.540 3.03e-08 ***
#meningeal.signsyes 0.188823 0.161088 1.172 0.241130
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 7902.5 on 27874 degrees of freedom
# Residual deviance: 7148.5 on 27853 degrees of freedom
# AIC: 7192.5
#
# Number of Fisher Scoring iterations: 6
npredictLOGIT<-predict(nmodelLOGIT,testing_Score)
>confusionMatrix(npredictLOGIT, testing_Score$death)
# Confusion Matrix and Statistics
#
# Reference
# Prediction no yes
# no 17982 592
# yes 6 3
#
# Accuracy : 0.9678
# 95% CI : (0.9652, 0.9703)
# No Information Rate : 0.968
# P-Value [Acc > NIR] : 0.5605
#
# Kappa : 0.009
# Mcnemar's Test P-Value : <2e-16
#
# Sensitivity : 0.999666
# Specificity : 0.005042
# Pos Pred Value : 0.968127
# Neg Pred Value : 0.333333
# Prevalence : 0.967981
# Detection Rate : 0.967659
# Detection Prevalence : 0.999516
# Balanced Accuracy : 0.502354
#
# 'Positive' Class : no
答案 0 :(得分:2)
100%准确度结果可能不正确。我假设它们是由于目标变量(或与目标变量基本上具有相同条目的另一个变量,如@ulfelder的注释中所指出的)包含在训练集和测试集中。通常这些列需要在模型构建和测试过程中被移除,因为它们代表描述分类的目标,而列车/测试数据应该仅包含(希望)根据目标变量导致正确分类的信息。
您可以尝试以下方法:
target <- riskFinal$death
set.seed(100)
intrain <- createDataPartition(riskFinal$death,p=0.6, list=FALSE)
training_Score <- riskFinal[intrain,]
testing_Score <- riskFinal[-intrain,]
train_target <- training_Score$death
test_target <- test_Score$death
training_Score <- training_Score[,-which(colnames(training_Score)=="death")]
test_Score <- test_Score[,-which(colnames(test_Score)=="death")]
modelRF <- train(training_Score, train_target, method="rf", trControl=control)
然后你可以像以前那样继续进行,注意目标“死亡”存储在变量train_target
和test_target
中。
希望这有帮助。