插入 - 在gafsControl()

时间:2015-09-10 06:33:25

标签: r-caret

我正在尝试在插入符seeds内设置gafsControl(),但我收到此错误:

Error in { : task 1 failed - "supplied seed is not a valid integer"

我理解seeds trainControl()是一个等于重新采样数加1的向量,带有模型调整参数组合的数量(在我的情况下为36,SVM为6 Sigma和6)每个(重新采样)条目中的成本值)。但是,我无法弄清楚我应该为gafsControl()使用什么。我已经尝试iters * popSize(100 * 10),iters(100),popSize(10),但没有一个有效。

提前致谢。

这是我的代码(带有模拟数据):

library(caret)
library(doMC)
library(kernlab)

registerDoMC(cores=32)

set.seed(1234)
train.set <- twoClassSim(300, noiseVars = 100, corrVar = 100, corrValue = 0.75)

mylogGA <- caretGA
mylogGA$fitness_extern <- mnLogLoss

#Index for gafsControl
set.seed(1045481)
ga_index <- createFolds(train.set$Class, k=3)

#Seed for the gafsControl()
set.seed(1056)
ga_seeds <- vector(mode = "list", length = 4)
for(i in 1:3) ga_seeds[[i]] <- sample.int(1500, 1000)

## For the last model:
ga_seeds[[4]] <- sample.int(1000, 1)

#Index for the trainControl()
set.seed(1045481)
tr_index <- createFolds(train.set$Class, k=5)

#Seeds for the trainControl()
set.seed(1056)
tr_seeds <- vector(mode = "list", length = 6)
for(i in 1:5) tr_seeds[[i]] <- sample.int(1000, 36)#

## For the last model:
tr_seeds[[6]] <- sample.int(1000, 1)


gaCtrl <- gafsControl(functions = mylogGA,
                      method = "cv",
                      number = 3,
                      metric = c(internal = "logLoss",
                                 external = "logLoss"),
                      verbose = TRUE,
                      maximize = c(internal = FALSE,
                                   external = FALSE),
                      index = ga_index,
                      seeds = ga_seeds,
                      allowParallel = TRUE)

tCtrl = trainControl(method = "cv", 
                     number = 5,
                     classProbs = TRUE,
                     summaryFunction = mnLogLoss,
                     index = tr_index,
                     seeds = tr_seeds,
                     allowParallel = FALSE)


svmGrid <- expand.grid(sigma= 2^c(-25, -20, -15,-10, -5, 0), C= 2^c(0:5))

t1 <- Sys.time()
set.seed(1234235)
svmFuser.gafs <- gafs(x = train.set[, names(train.set) != "Class"],
                      y = train.set$Class,
                      gafsControl = gaCtrl,
                      trControl = tCtrl,
                      popSize = 10,
                      iters = 100,
                      method = "svmRadial",
                      preProc = c("center", "scale"),
                      tuneGrid = svmGrid,
                      metric="logLoss",
                      maximize = FALSE)

t2<- Sys.time()
svmFuser.gafs.time<-difftime(t2,t1)

save(svmFuser.gafs, file ="svmFuser.gafs.rda")
save(svmFuser.gafs.time, file ="svmFuser.gafs.time.rda")

会话信息:

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8       
 [4] LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
 [10] LC_TELEPHONE=C            LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] kernlab_0.9-22  doMC_1.3.3      iterators_1.0.7 foreach_1.4.2   caret_6.0-52    ggplot2_1.0.1   lattice_0.20-33

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0         magrittr_1.5        splines_3.2.2        MASS_7.3-43         munsell_0.4.2      
 [6] colorspace_1.2-6    foreach_1.4.2       minqa_1.2.4         car_2.0-26          stringr_1.0.0      
 [11] plyr_1.8.3          tools_3.2.2         parallel_3.2.2      pbkrtest_0.4-2      nnet_7.3-10        
 [16] grid_3.2.2          gtable_0.1.2        nlme_3.1-122        mgcv_1.8-7          quantreg_5.18      
 [21] MatrixModels_0.4-1  iterators_1.0.7     gtools_3.5.0        lme4_1.1-9          digest_0.6.8       
 [26] Matrix_1.2-2        nloptr_1.0.4        reshape2_1.4.1      codetools_0.2-11    stringi_0.5-5      
 [31] compiler_3.2.2      BradleyTerry2_1.0-6 scales_0.3.0        stats4_3.2.2        SparseM_1.7        
 [36] brglm_0.5-9         proto_0.3-10       
> 

2 个答案:

答案 0 :(得分:3)

我对你提到的gafsControl()函数不太熟悉,但是在使用trainControl()设置并行种子时遇到了一个非常类似的问题。在说明中,它描述了如何创建列表(长度=重新采样的数量+ 1),其中每个项目是一个列表(长度=要测试的参数组合的数量)。我发现这样做不起作用(参见topepo / caret issue#248获取信息)。但是,如果您将每个项目转换为矢量,例如

09-18 00:16:12.614  25181-25181/com.example.somename I/EXPLORECA﹕ COLUMN_ID 1
09-18 00:16:12.614  25181-25181/com. example.somename I/EXPLORECA﹕ COLUMN_LATITUDE  21.36654189
09-18 00:16:12.614  25181-25181/com. example.somename I/EXPLORECA﹕ COLUMN_LONGITUDE 6.945669
09-18 00:16:12.614  25181-25181/com. example.somename I/EXPLORECA﹕ COLUMN_TIME 12:10:34 AM
09-18 00:16:12.614  25181-25181/com. example.somename I/EXPLORECA﹕ locationdata com. example.somename.LocationData@1b6b10f7
09-18 00:16:12.614  25181-25181/com. example.somename I/EXPLORECA﹕ locations [com. example.somename.LocationData@1b6b10f7]
然后种子似乎起作用(即模型和预测完全可重复)。我应该澄清这是使用doMC作为后端。对于其他并行后端,它可能会有所不同。

希望这有帮助

答案 1 :(得分:2)

我通过检查gafs.default找出了我的错误。 seedsgafsControl() vector的长度为(n_repeats*nresampling)+1而非list(如trainControl$seeds所示)。实际上在?gafsControl的文档中说明seeds is a vector or integers that can be used to set the seed during each search. The number of seeds must be equal to the number of resamples plus one.我认真思考,这提醒仔细阅读文档:D。

    if (!is.null(gafsControl$seeds)) {
        if (length(gafsControl$seeds) < length(gafsControl$index) + 
            1) 
            stop(paste("There must be at least", length(gafsControl$index) + 
            1, "random number seeds passed to gafsControl"))
    }
    else {
        gafsControl$seeds <- sample.int(1e+05, length(gafsControl$index) + 
        1)
    }

因此,设置ga_seeds的正确方法是:

#Index for gafsControl
set.seed(1045481)
ga_index <- createFolds(train.set$Class, k=3)

#Seed for the gafsControl()
set.seed(1056)
ga_seeds <- sample.int(1500, 4)