好吧,这里已经多次询问了这个问题。但是,所有答案都是描述性的,我不知道该如何处理(因为我是该领域的新手)。因此,我在这里给出了所有代码。我正在使用以下数据集进行练习:http://archive.ics.uci.edu/ml/datasets/Census+Income
library(ggplot2)
library(caret)
library(e1071)
# Assigning column names
colNames = c ("age", "workclass", "fnlwgt", "education", "educationnum",
"maritalstatus", "occupation", "relationship", "race", "sex",
"capitalgain", "capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# loading data
df <- read.table ('adult.data', header = FALSE, sep = ',',
strip.white = TRUE, col.names = colNames,
na.strings = '?', stringsAsFactors = TRUE)
检查昏暗
dim(df)
输出:
[1] 32561 15
太大,太短了(否则在应用算法时会花费太多时间):
df <- df[1:1200,]
看看我们的数据
> str(df)
'data.frame': 1200 obs. of 15 variables:
$ age : int 39 50 38 53 28 37 49 52 31 42 ...
$ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
删除无用/不重要的变量
df$fnlwgt <- NULL
删除缺失值
sum(df == '')
df[df == ''] <- NA
sum(!complete.cases(df))
df <- df[complete.cases(df),]
拆分数据集
trainingInst <- createDataPartition(df$incomelevel,p=0.8,list = FALSE)
trainingSet <- df[trainingInst,]
testingSet <- df[-trainingInst,]
两个数据集的条形图
barplot(table(trainingSet$incomelevel), xlab = 'Income Level', ylab = 'Frequency',
main = 'Income Level at training dataset')
barplot(table(testingSet$incomelevel), xlab = 'Income Level', ylab = 'Frequency',
main = 'Income Level at testing dataset')
输出:
构建模型
set.seed(291)
control <- trainControl(method='repeatedcv',
number=10,
repeats=3,
search='grid')
tunegrid <- expand.grid(.mtry = (1:10))
rf.fit <- train(incomelevel ~ .,
data = trainingSet,
method = 'rf',
metric = 'Accuracy',
tuneGrid = tunegrid)
print(rf.fit)
输出:
> print(rf.fit)
Random Forest
883 samples
13 predictor
2 classes: '<=50K', '>50K'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 883, 883, 883, 883, 883, 883, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
1 0.7579655 0.0000000
2 0.7579655 0.0000000
3 0.7802265 0.1471624
4 0.8156137 0.3628476
5 0.8221829 0.4242543
6 0.8269784 0.4616396
7 0.8313258 0.4899318
8 0.8323117 0.5017060
9 0.8298249 0.4988749
10 0.8266483 0.4917950
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 8.
检查混乱矩阵
confusionMatrix.train(rf.fit)
输出:
> confusionMatrix.train(rf.fit)
Bootstrapped (25 reps) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction <=50K >50K
<=50K 70.3 11.3
>50K 5.4 12.9
Accuracy (average) : 0.8323
预测:
testingSetPred <- predict(rf.fit,testingSet)
confusionMatrix(testingSetPred,testingSet$incomelevel)
输出:
> confusionMatrix(testingSetPred,testingSet$incomelevel)
Confusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 153 23
>50K 13 31
Accuracy : 0.8364
95% CI : (0.7807, 0.8827)
No Information Rate : 0.7545
P-Value [Acc > NIR] : 0.002207
Kappa : 0.5288
Mcnemar's Test P-Value : 0.133614
Sensitivity : 0.9217
Specificity : 0.5741
Pos Pred Value : 0.8693
Neg Pred Value : 0.7045
Prevalence : 0.7545
Detection Rate : 0.6955
Detection Prevalence : 0.8000
Balanced Accuracy : 0.7479
'Positive' Class : <=50K
培训的准确性为 0.8323 ,而测试的准确性为 0.8364 。好吧,这里不大,实际上几乎相等。但是,在这里,我为模型配备了所有功能(incomelevel ~ .,
);但在某些情况下,我并未使用所有功能。在这种情况下,我的训练准确性与结果相同( 0.8323 )。但是,测试精度要高一点(例如, 0.85 )。为什么会这样呢?我做错了什么吗?