插入火车svmRadial和kNN花了太长时间 - 有没有办法提高train()性能

时间:2017-09-18 01:02:24

标签: r svm r-caret knn

我有一个使用插入符函数的模型,我用来比较几个模型。我注意到运行svmRadial和kNN会花费一些时间在我选择的数据集上

数据集:原始是人口普查收入数据集,但我减少了数据集,结果数据如下:

'data.frame':   32561 obs. of  13 variables:
 $ age             : int  39 50 38 53 28 37 49 52 31 42 ...
 $ fnlwgt          : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
 $ educationnum    : int  13 13 9 7 13 14 5 9 14 13 ...
 $ maritalstatus   : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
 $ occupation      : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
 $ race            : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
 $ sex             : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
 $ hoursperweek    : int  40 13 40 40 40 40 16 45 50 40 ...
 $ response        : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
 $ cntrymap        : Factor w/ 9 levels "British-Commonwealth",..: 9 9 9 9 6 9 5 9 9 9 ...
 $ relationship_new: Factor w/ 5 levels "Not-in-family",..: 1 4 1 4 4 4 1 4 1 4 ...
 $ workclass_new   : Factor w/ 8 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
 $ capitalgainloss : int  2174 0 0 0 0 0 0 0 14084 5178 ...

数据集中仍有很多因素,我不确定这是否应该预处理或重新设计以帮助train()调用性能

在70/30数据上运行kNN仍然需要相当长的时间但不比svmRadial长:

adult.kNN <- train(response~., data=adultFile, method="knn", metric=metric, preProc=preProc, trControl=control, tuneLength=10)

adult.svmRadial <- train(response~., data=adultTraining, method="svmRadial", metric=metric, preProc=c("center", "scale"), 
                          trControl=control, tuneLength = 5)

以下是我的代码,可以让数据设置重现问题:

data_url <- c("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data")
download.file(url = data_url, destfile = "adult.data")
fullData <- read.csv("adult.data", sep = ',', header = FALSE,strip.white = TRUE)
#fullData <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", header=F,strip.white=TRUE)
names(fullData) <- c("age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation", "relationship", "race", "sex", "capitalgain", "capitalloss", "hoursperweek", "nativecountry", "response")

prepInputFile2 <- sqldf("select *,
      case 
        when   nativecountry == 'United-States' then 'United-States'
        when   nativecountry == 'China'    OR nativecountry == 'Hong' OR nativecountry == 'Taiwan' then 'China'
        when   nativecountry == 'Cambodia' OR nativecountry == 'Laos' OR nativecountry == 'Philippines' 
            OR nativecountry == 'Thailand' OR nativecountry == 'Vietnam' then 'SoutEast-Asia'
        when   nativecountry == 'Canada'   OR nativecountry == 'England' OR nativecountry == 'India' OR nativecountry == 'Ireland'
            OR nativecountry == 'Scotland' then 'British-Commonwealth'
        when   nativecountry == 'Columbia' OR nativecountry == 'El-Salvador' OR nativecountry == 'Ecuador' OR nativecountry == 'Peru' 
        then 'South-America'
        when   nativecountry == 'Dominican-Republic' OR nativecountry == 'Guatemala' OR nativecountry == 'Haiti'
            OR nativecountry == 'Honduras' OR nativecountry == 'Jamaica' OR nativecountry == 'Mexico' OR nativecountry =='Nicaragua'
            OR nativecountry == 'Outlying-US(Guam-USVI-etc)' OR nativecountry == 'Puerto-Rico' OR nativecountry =='Trinadad&Tobago' 
        then 'Latin-America'
        when   nativecountry == 'France' OR nativecountry == 'Germany' OR nativecountry == 'Holand-Netherlands' 
            OR nativecountry == 'Italy' then 'Euro-1'
        when   nativecountry == 'Yugoslavia' OR nativecountry == 'Greece' OR nativecountry == 'Hungary' OR nativecountry == 'Poland'
            OR nativecountry == 'Portugal'   OR nativecountry == 'South' then 'Euro-2'
        when   nativecountry == 'Cuba'       OR nativecountry == 'Iran' OR nativecountry == 'Japan' OR nativecountry == '?' then 'Other'
        else 'Undetermined'
      end as cntrymap,
      case
        when relationship == 'Husband' OR relationship == 'Wife' then 'Spouse' else relationship
      end as relationship_new,
      case 
        when workclass == 'Without-pay' then 'Never-worked' else workclass
      end as workclass_new,
      capitalgain - capitalloss as capitalgainloss
    from fullData
      ")
prepInputFile2$cntrymap <- as.factor(prepInputFile2$cntrymap)
prepInputFile2$workclass_new <- as.factor(prepInputFile2$workclass_new)
prepInputFile2$relationship_new <- as.factor(prepInputFile2$relationship_new)
dropColNames = c('education','capitalgain','capitalloss', 'workclass','relationship','nativecountry')
prepInputFile2 <- prepInputFile2[ , !(names(prepInputFile2) %in% dropColNames)]

0 个答案:

没有答案