Question

我面临以下问题：我正在训练随机森林进行二元预测。数据结构如此：

> str(data)
'data.frame':   120269 obs. of  11 variables:
 $ SeriousDlqin2yrs                    : num  1 0 0 0 0 0 0 0 0 0 ...
 $ RevolvingUtilizationOfUnsecuredLines: num  0.766 0.957 0.658 0.234 0.907       ... 
 $ age                                 : num  45 40 38 30 49 74 39 57 30 51 ...
 $ NumberOfTime30.59DaysPastDueNotWorse: num  2 0 1 0 1 0 0 0 0 0 ...
 $ DebtRatio                           : num  0.803 0.1219 0.0851 0.036 0.0249 ...
 $ MonthlyIncome                       : num  9120 2600 3042 3300 63588 ...
 $ NumberOfOpenCreditLinesAndLoans     : num  13 4 2 5 7 3 8 9 5 7 ...
 $ NumberOfTimes90DaysLate             : num  0 0 1 0 0 0 0 0 0 0 ...
 $ NumberRealEstateLoansOrLines        : num  6 0 0 0 1 1 0 4 0 2 ...
 $ NumberOfTime60.89DaysPastDueNotWorse: num  0 0 0 0 0 0 0 0 0 0 ...
 $ NumberOfDependents                  : num  2 1 0 0 0 1 0 2 0 2 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:29731] 7 9 17 33 42 53 59 63 72 87 ...
 .. ..- attr(*, "names")= chr [1:29731] "7" "9" "17" "33" ...

我拆分数据

index <- sample(1:nrow(data),round(0.75*nrow(data)))
train <- data[index,]
test <- data[-index,]

然后我运行模型并尝试进行预测：

model.rf <- randomForest(as.factor(train[,1]) ~ ., data=train,ntree=1000,mtry=10,importance=TRUE)
pred.rf <- predict(model.rf, test, type = "prob")
rfpred <- c(1:22773)
rfpred[pred.rf[,1]<=0.5] <- "yes"
rfpred[pred.rf[,1]>0.5] <- "no"
rfpred <- factor(rfpred)
test[,1][test[,1]==1] <- "yes"
test[,1][test[,1]==0] <- "no"
test[,1] <- factor(test[,1])
confusionMatrix(as.factor(rfpred), as.factor(test$Y))

我得到的是以下输出：

> print(model.rf)

Call:
 randomForest(formula = as.factor(train[, 1]) ~ ., data = train,      ntree = 1000, mtry = 10, importance = TRUE) 
           Type of random forest: classification
                 Number of trees: 1000
No. of variables tried at each split: 10

    OOB estimate of  error rate: 0%
Confusion matrix:
  0     1 class.error
0 43093     0           0
1     0 25225           0

> head(pred.rf)
         0 1
45868.1  1 0
112445   1 0
39001    1 0
133443   1 0
137460   1 0
125835.1 1 0

> confusionMatrix(as.factor(rfpred), as.factor(test$Y))
Confusion Matrix and Statistics

          Reference
Prediction    no   yes
       no  14570     0
       yes     0  8203

                Accuracy : 1          
                 95% CI : (0.9998, 1)
    No Information Rate : 0.6398     
    P-Value [Acc > NIR] : < 2.2e-16  

                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

        Sensitivity : 1.0000     
        Specificity : 1.0000     
     Pos Pred Value : 1.0000     
     Neg Pred Value : 1.0000     
         Prevalence : 0.6398     
     Detection Rate : 0.6398     
   Detection Prevalence : 0.6398     
      Balanced Accuracy : 1.0000     

   'Positive' Class : no

显然模型不能那么准确!!我的代码出了什么问题？

随机森林过度拟合？

0 个答案: