在caret :: train函数中使用bagImpute预处理时缺少值错误

时间:2016-09-21 14:51:44

标签: r machine-learning random-forest r-caret cross-validation

我想使用repeatedcv训练一个caret::train程序的随机森林模型。我的数据有一些缺失值,所以我想在train函数中使用preProcess="bagImpute"选项。我不想在列车之外使用preProcess函数,因为我想bagImpute repeatedcv过程的每次迭代Error in { : task 1 failed - "'n' must be a positive integer >= 'x'" In addition: There were 50 or more warnings (use warnings() to see the first 50) > warnings() Warning messages: 1: In eval(expr, envir, enclos) : model fit failed for Fold01.Rep01: mtry=2 Error in na.fail.default(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, : missing values in object 我的数据。但是,当我尝试这样做时,会抛出错误:

library(caret)

data(iris)
inTrain <- createDataPartition(iris$Species, p=0.8, list=FALSE)
training <- iris[inTrain, ]


fillInNa <- function(d) {
      naCount <- NROW(d) * 0.1
      for (i in sample(NROW(d), naCount)) {
            d[i, sample(4, 1)] <- NA
       }
      return(d)
 }

 training <- fillInNa(training)

tc<-trainControl("repeatedcv", repeats=30, selectionFunction="oneSE",returnData=T, 
classProbs = T,num=10, preProcOptions ="bagImpute", 
summaryFunction=multiClassSummary, savePredictions = T)

training.x<-training[,1:4]
training.y<-training[,5]

rfTri_Bag<- train(training.x,training.y, 
              method="rf", 
              trControl=tc, 
              preProcess= c("bagImpute"),
              tuneLength=10,
              control=rpart.control(usesurrogate=0),
              ntree=250,
              proximity=T)

以下是使用虹膜数据的最小可重现性示例。我从他的网站http://mkseo.pe.kr/stats/?p=719借用了Minkoo的数据集准备的初始代码。非常感谢Minkoo!

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_UnitedStates.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ipred_0.9-5         e1071_1.6-7         latticeExtra_0.6-28 RColorBrewer_1.1-2  randomForest_4.6-12 caret_6.0-71       
 [7] rpart_4.1-10        party_1.0-25        strucchange_1.5-1   sandwich_2.3-4      zoo_1.7-13          modeltools_0.2-21  
[13] mvtnorm_1.0-5       gdata_2.17.0        DMwR_0.4.1          pROC_1.8            Metrics_0.1.1       raster_2.5-8       
[19] sp_1.2-3            gridExtra_2.2.1     readr_1.0.0         tidyr_0.6.0         tibble_1.2          tidyverse_1.0.0    
[25] MuMIn_1.15.6        merTools_0.2.2      devtools_1.12.0     plyr_1.8.4          arm_1.9-1           lattice_0.20-33    
[31] MASS_7.3-45         xtable_1.8-2        lmerTest_2.0-32     lme4_1.1-12         Matrix_1.2-6        xlsx_0.5.7         
[37] xlsxjars_0.6.1      rJava_0.9-8         AICcmodavg_2.0-4    pander_0.6.0        ggplot2_2.1.0       purrr_0.2.2        
[43] dplyr_0.5.0         broom_0.4.1        

loaded via a namespace (and not attached):
 [1] TH.data_1.0-7      VGAM_1.0-2         minqa_1.2.4        colorspace_1.2-6   class_7.3-14       MatrixModels_0.4-1
 [7] DT_0.2             prodlim_1.5.7      coin_1.1-2         codetools_0.2-14   splines_3.3.1      mnormt_1.5-4      
[13] knitr_1.14         Formula_1.2-1      nloptr_1.0.4       pbkrtest_0.4-6     cluster_2.0.4      shiny_0.14        
[19] compiler_3.3.1     httr_1.2.1         assertthat_0.1     lazyeval_0.2.0     acepack_1.3-3.3    htmltools_0.3.5   
[25] quantreg_5.29      tools_3.3.1        coda_0.18-1        gtable_0.2.0       reshape2_1.4.1     Rcpp_0.12.7       
[31] nlme_3.1-128       iterators_1.0.8    psych_1.6.6        stringr_1.1.0      mime_0.5           gtools_3.5.0      
[37] scales_0.4.0       parallel_3.3.1     SparseM_1.7        yaml_2.1.13        quantmod_0.4-6     curl_1.2          
[43] memoise_1.0.0      reshape_0.8.5      stringi_1.1.1      foreach_1.4.3      blme_1.0-4         TTR_0.23-1        
[49] caTools_1.17.1     boot_1.3-18        lava_1.4.4         chron_2.3-47       bitops_1.0-6       evaluate_0.9      
[55] ROCR_1.0-7         htmlwidgets_0.7    labeling_0.3       magrittr_1.5       R6_2.1.3           gplots_3.0.1      
[61] Hmisc_3.17-4       multcomp_1.4-6     DBI_0.5            foreign_0.8-66     withr_1.0.2        mgcv_1.8-12       
[67] xts_0.9-7          survival_2.39-4    abind_1.4-5        nnet_7.3-12        car_2.1-3          KernSmooth_2.23-15
[73] rmarkdown_1.0      data.table_1.9.6   git2r_0.15.0       digest_0.6.10      httpuv_1.3.3       munsell_0.4.3     
[79] unmarked_0.11-0   

编辑:这是我的会话信息:

preProcess()

编辑2:此处https://stackoverflow.com/a/20081954/5617640提出了一个几乎完全相同的问题,但给出的答案只是简单地说明了如何从train()对象 来预测{{1}}功能。正如@Misconstruction在评论中指出的那样,使用这种方法,估算值不会包含在CV循环中。&#34; - 我的想法完全正确。

1 个答案:

答案 0 :(得分:0)

这不是错误消息的解决方案,但希望能够解决您的问题。

如果您正在运行随机森林模型,那么它本身就会交叉验证&#34;从某种意义上说,它与袋外(OOB)误差估计有关。在使用随机森林时,任何类型的交叉验证都有不需要,如本Berkeley文章所示:

&#34; 在随机森林中,不需要交叉验证或单独的测试集来获得测试集错误的无偏估计。在运行期间内部估计...... &#34; (https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm