我有一个使用插入符函数的模型,我用来比较几个模型。我注意到运行svmRadial和kNN会花费一些时间在我选择的数据集上
数据集:原始是人口普查收入数据集,但我减少了数据集,结果数据如下:
'data.frame': 32561 obs. of 13 variables:
$ age : int 39 50 38 53 28 37 49 52 31 42 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ response : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
$ cntrymap : Factor w/ 9 levels "British-Commonwealth",..: 9 9 9 9 6 9 5 9 9 9 ...
$ relationship_new: Factor w/ 5 levels "Not-in-family",..: 1 4 1 4 4 4 1 4 1 4 ...
$ workclass_new : Factor w/ 8 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
$ capitalgainloss : int 2174 0 0 0 0 0 0 0 14084 5178 ...
数据集中仍有很多因素,我不确定这是否应该预处理或重新设计以帮助train()调用性能
在70/30数据上运行kNN仍然需要相当长的时间但不比svmRadial长:
adult.kNN <- train(response~., data=adultFile, method="knn", metric=metric, preProc=preProc, trControl=control, tuneLength=10)
adult.svmRadial <- train(response~., data=adultTraining, method="svmRadial", metric=metric, preProc=c("center", "scale"),
trControl=control, tuneLength = 5)
以下是我的代码,可以让数据设置重现问题:
data_url <- c("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data")
download.file(url = data_url, destfile = "adult.data")
fullData <- read.csv("adult.data", sep = ',', header = FALSE,strip.white = TRUE)
#fullData <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", header=F,strip.white=TRUE)
names(fullData) <- c("age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation", "relationship", "race", "sex", "capitalgain", "capitalloss", "hoursperweek", "nativecountry", "response")
prepInputFile2 <- sqldf("select *,
case
when nativecountry == 'United-States' then 'United-States'
when nativecountry == 'China' OR nativecountry == 'Hong' OR nativecountry == 'Taiwan' then 'China'
when nativecountry == 'Cambodia' OR nativecountry == 'Laos' OR nativecountry == 'Philippines'
OR nativecountry == 'Thailand' OR nativecountry == 'Vietnam' then 'SoutEast-Asia'
when nativecountry == 'Canada' OR nativecountry == 'England' OR nativecountry == 'India' OR nativecountry == 'Ireland'
OR nativecountry == 'Scotland' then 'British-Commonwealth'
when nativecountry == 'Columbia' OR nativecountry == 'El-Salvador' OR nativecountry == 'Ecuador' OR nativecountry == 'Peru'
then 'South-America'
when nativecountry == 'Dominican-Republic' OR nativecountry == 'Guatemala' OR nativecountry == 'Haiti'
OR nativecountry == 'Honduras' OR nativecountry == 'Jamaica' OR nativecountry == 'Mexico' OR nativecountry =='Nicaragua'
OR nativecountry == 'Outlying-US(Guam-USVI-etc)' OR nativecountry == 'Puerto-Rico' OR nativecountry =='Trinadad&Tobago'
then 'Latin-America'
when nativecountry == 'France' OR nativecountry == 'Germany' OR nativecountry == 'Holand-Netherlands'
OR nativecountry == 'Italy' then 'Euro-1'
when nativecountry == 'Yugoslavia' OR nativecountry == 'Greece' OR nativecountry == 'Hungary' OR nativecountry == 'Poland'
OR nativecountry == 'Portugal' OR nativecountry == 'South' then 'Euro-2'
when nativecountry == 'Cuba' OR nativecountry == 'Iran' OR nativecountry == 'Japan' OR nativecountry == '?' then 'Other'
else 'Undetermined'
end as cntrymap,
case
when relationship == 'Husband' OR relationship == 'Wife' then 'Spouse' else relationship
end as relationship_new,
case
when workclass == 'Without-pay' then 'Never-worked' else workclass
end as workclass_new,
capitalgain - capitalloss as capitalgainloss
from fullData
")
prepInputFile2$cntrymap <- as.factor(prepInputFile2$cntrymap)
prepInputFile2$workclass_new <- as.factor(prepInputFile2$workclass_new)
prepInputFile2$relationship_new <- as.factor(prepInputFile2$relationship_new)
dropColNames = c('education','capitalgain','capitalloss', 'workclass','relationship','nativecountry')
prepInputFile2 <- prepInputFile2[ , !(names(prepInputFile2) %in% dropColNames)]