Question

好吧，这里已经多次询问了这个问题。但是，所有答案都是描述性的，我不知道该如何处理（因为我是该领域的新手）。因此，我在这里给出了所有代码。我正在使用以下数据集进行练习：http://archive.ics.uci.edu/ml/datasets/Census+Income

library(ggplot2)
library(caret)
library(e1071)

# Assigning column names
colNames = c ("age", "workclass", "fnlwgt", "education", "educationnum", 
              "maritalstatus", "occupation", "relationship", "race", "sex", 
              "capitalgain", "capitalloss", "hoursperweek", "nativecountry",
              "incomelevel")

# loading data
df <- read.table ('adult.data', header = FALSE, sep = ',', 
                          strip.white = TRUE, col.names = colNames, 
                          na.strings = '?', stringsAsFactors = TRUE)

检查昏暗

dim(df)

输出：

[1] 32561    15

太大，太短了（否则在应用算法时会花费太多时间）：

df <- df[1:1200,]

看看我们的数据

> str(df)
'data.frame':   1200 obs. of  15 variables:
 $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
 $ workclass    : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
 $ fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
 $ education    : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
 $ educationnum : int  13 13 9 7 13 14 5 9 14 13 ...
 $ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
 $ occupation   : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
 $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
 $ race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
 $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
 $ capitalgain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
 $ capitalloss  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hoursperweek : int  40 13 40 40 40 40 16 45 50 40 ...
 $ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
 $ incomelevel  : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...

删除无用/不重要的变量

df$fnlwgt <- NULL

删除缺失值

sum(df == '')
df[df == ''] <- NA
sum(!complete.cases(df))
df <- df[complete.cases(df),]

拆分数据集

trainingInst <- createDataPartition(df$incomelevel,p=0.8,list = FALSE)
trainingSet <- df[trainingInst,]
testingSet <- df[-trainingInst,]

两个数据集的条形图

barplot(table(trainingSet$incomelevel), xlab = 'Income Level', ylab = 'Frequency',
        main = 'Income Level at training dataset')
barplot(table(testingSet$incomelevel), xlab = 'Income Level', ylab = 'Frequency',
        main = 'Income Level at testing dataset')

输出：

构建模型

set.seed(291)
control <- trainControl(method='repeatedcv', 
                        number=10, 
                        repeats=3, 
                        search='grid')

tunegrid <- expand.grid(.mtry = (1:10)) 

rf.fit <- train(incomelevel ~ .,
            data = trainingSet,
            method = 'rf',
            metric = 'Accuracy',
            tuneGrid = tunegrid)
print(rf.fit)

输出：

> print(rf.fit)
Random Forest 

883 samples
 13 predictor
  2 classes: '<=50K', '>50K' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 883, 883, 883, 883, 883, 883, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   1    0.7579655  0.0000000
   2    0.7579655  0.0000000
   3    0.7802265  0.1471624
   4    0.8156137  0.3628476
   5    0.8221829  0.4242543
   6    0.8269784  0.4616396
   7    0.8313258  0.4899318
   8    0.8323117  0.5017060
   9    0.8298249  0.4988749
  10    0.8266483  0.4917950

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 8.

检查混乱矩阵

confusionMatrix.train(rf.fit)

输出：

> confusionMatrix.train(rf.fit)
Bootstrapped (25 reps) Confusion Matrix 

(entries are percentual average cell counts across resamples)

          Reference
Prediction <=50K >50K
     <=50K  70.3 11.3
     >50K    5.4 12.9

 Accuracy (average) : 0.8323

预测：

testingSetPred <- predict(rf.fit,testingSet)
confusionMatrix(testingSetPred,testingSet$incomelevel)

输出：

> confusionMatrix(testingSetPred,testingSet$incomelevel)
Confusion Matrix and Statistics

          Reference
Prediction <=50K >50K
     <=50K   153   23
     >50K     13   31

               Accuracy : 0.8364          
                 95% CI : (0.7807, 0.8827)
    No Information Rate : 0.7545          
    P-Value [Acc > NIR] : 0.002207        

                  Kappa : 0.5288          

 Mcnemar's Test P-Value : 0.133614        

            Sensitivity : 0.9217          
            Specificity : 0.5741          
         Pos Pred Value : 0.8693          
         Neg Pred Value : 0.7045          
             Prevalence : 0.7545          
         Detection Rate : 0.6955          
   Detection Prevalence : 0.8000          
      Balanced Accuracy : 0.7479          

       'Positive' Class : <=50K

培训的准确性为 0.8323 ，而测试的准确性为 0.8364 。好吧，这里不大，实际上几乎相等。但是，在这里，我为模型配备了所有功能（incomelevel ~ .,）；但在某些情况下，我并未使用所有功能。在这种情况下，我的训练准确性与结果相同（ 0.8323 ）。但是，测试精度要高一点（例如， 0.85 ）。为什么会这样呢？我做错了什么吗？

测试精度大于训练精度

0 个答案: