我面临以下问题:我正在训练随机森林进行二元预测。数据结构如此:
> str(data)
'data.frame': 120269 obs. of 11 variables:
$ SeriousDlqin2yrs : num 1 0 0 0 0 0 0 0 0 0 ...
$ RevolvingUtilizationOfUnsecuredLines: num 0.766 0.957 0.658 0.234 0.907 ...
$ age : num 45 40 38 30 49 74 39 57 30 51 ...
$ NumberOfTime30.59DaysPastDueNotWorse: num 2 0 1 0 1 0 0 0 0 0 ...
$ DebtRatio : num 0.803 0.1219 0.0851 0.036 0.0249 ...
$ MonthlyIncome : num 9120 2600 3042 3300 63588 ...
$ NumberOfOpenCreditLinesAndLoans : num 13 4 2 5 7 3 8 9 5 7 ...
$ NumberOfTimes90DaysLate : num 0 0 1 0 0 0 0 0 0 0 ...
$ NumberRealEstateLoansOrLines : num 6 0 0 0 1 1 0 4 0 2 ...
$ NumberOfTime60.89DaysPastDueNotWorse: num 0 0 0 0 0 0 0 0 0 0 ...
$ NumberOfDependents : num 2 1 0 0 0 1 0 2 0 2 ...
- attr(*, "na.action")=Class 'omit' Named int [1:29731] 7 9 17 33 42 53 59 63 72 87 ...
.. ..- attr(*, "names")= chr [1:29731] "7" "9" "17" "33" ...
我拆分数据
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train <- data[index,]
test <- data[-index,]
然后我运行模型并尝试进行预测:
model.rf <- randomForest(as.factor(train[,1]) ~ ., data=train,ntree=1000,mtry=10,importance=TRUE)
pred.rf <- predict(model.rf, test, type = "prob")
rfpred <- c(1:22773)
rfpred[pred.rf[,1]<=0.5] <- "yes"
rfpred[pred.rf[,1]>0.5] <- "no"
rfpred <- factor(rfpred)
test[,1][test[,1]==1] <- "yes"
test[,1][test[,1]==0] <- "no"
test[,1] <- factor(test[,1])
confusionMatrix(as.factor(rfpred), as.factor(test$Y))
我得到的是以下输出:
> print(model.rf)
Call:
randomForest(formula = as.factor(train[, 1]) ~ ., data = train, ntree = 1000, mtry = 10, importance = TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 10
OOB estimate of error rate: 0%
Confusion matrix:
0 1 class.error
0 43093 0 0
1 0 25225 0
> head(pred.rf)
0 1
45868.1 1 0
112445 1 0
39001 1 0
133443 1 0
137460 1 0
125835.1 1 0
> confusionMatrix(as.factor(rfpred), as.factor(test$Y))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 14570 0
yes 0 8203
Accuracy : 1
95% CI : (0.9998, 1)
No Information Rate : 0.6398
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.6398
Detection Rate : 0.6398
Detection Prevalence : 0.6398
Balanced Accuracy : 1.0000
'Positive' Class : no
显然模型不能那么准确!!我的代码出了什么问题?