我正在尝试在插入符seeds
内设置gafsControl()
,但我收到此错误:
Error in { : task 1 failed - "supplied seed is not a valid integer"
我理解seeds
trainControl()
是一个等于重新采样数加1的向量,带有模型调整参数组合的数量(在我的情况下为36,SVM为6 Sigma和6)每个(重新采样)条目中的成本值)。但是,我无法弄清楚我应该为gafsControl()
使用什么。我已经尝试iters
* popSize
(100 * 10),iters
(100),popSize
(10),但没有一个有效。
提前致谢。
这是我的代码(带有模拟数据):
library(caret)
library(doMC)
library(kernlab)
registerDoMC(cores=32)
set.seed(1234)
train.set <- twoClassSim(300, noiseVars = 100, corrVar = 100, corrValue = 0.75)
mylogGA <- caretGA
mylogGA$fitness_extern <- mnLogLoss
#Index for gafsControl
set.seed(1045481)
ga_index <- createFolds(train.set$Class, k=3)
#Seed for the gafsControl()
set.seed(1056)
ga_seeds <- vector(mode = "list", length = 4)
for(i in 1:3) ga_seeds[[i]] <- sample.int(1500, 1000)
## For the last model:
ga_seeds[[4]] <- sample.int(1000, 1)
#Index for the trainControl()
set.seed(1045481)
tr_index <- createFolds(train.set$Class, k=5)
#Seeds for the trainControl()
set.seed(1056)
tr_seeds <- vector(mode = "list", length = 6)
for(i in 1:5) tr_seeds[[i]] <- sample.int(1000, 36)#
## For the last model:
tr_seeds[[6]] <- sample.int(1000, 1)
gaCtrl <- gafsControl(functions = mylogGA,
method = "cv",
number = 3,
metric = c(internal = "logLoss",
external = "logLoss"),
verbose = TRUE,
maximize = c(internal = FALSE,
external = FALSE),
index = ga_index,
seeds = ga_seeds,
allowParallel = TRUE)
tCtrl = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = mnLogLoss,
index = tr_index,
seeds = tr_seeds,
allowParallel = FALSE)
svmGrid <- expand.grid(sigma= 2^c(-25, -20, -15,-10, -5, 0), C= 2^c(0:5))
t1 <- Sys.time()
set.seed(1234235)
svmFuser.gafs <- gafs(x = train.set[, names(train.set) != "Class"],
y = train.set$Class,
gafsControl = gaCtrl,
trControl = tCtrl,
popSize = 10,
iters = 100,
method = "svmRadial",
preProc = c("center", "scale"),
tuneGrid = svmGrid,
metric="logLoss",
maximize = FALSE)
t2<- Sys.time()
svmFuser.gafs.time<-difftime(t2,t1)
save(svmFuser.gafs, file ="svmFuser.gafs.rda")
save(svmFuser.gafs.time, file ="svmFuser.gafs.time.rda")
会话信息:
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8
[4] LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] kernlab_0.9-22 doMC_1.3.3 iterators_1.0.7 foreach_1.4.2 caret_6.0-52 ggplot2_1.0.1 lattice_0.20-33
loaded via a namespace (and not attached):
[1] Rcpp_0.12.0 magrittr_1.5 splines_3.2.2 MASS_7.3-43 munsell_0.4.2
[6] colorspace_1.2-6 foreach_1.4.2 minqa_1.2.4 car_2.0-26 stringr_1.0.0
[11] plyr_1.8.3 tools_3.2.2 parallel_3.2.2 pbkrtest_0.4-2 nnet_7.3-10
[16] grid_3.2.2 gtable_0.1.2 nlme_3.1-122 mgcv_1.8-7 quantreg_5.18
[21] MatrixModels_0.4-1 iterators_1.0.7 gtools_3.5.0 lme4_1.1-9 digest_0.6.8
[26] Matrix_1.2-2 nloptr_1.0.4 reshape2_1.4.1 codetools_0.2-11 stringi_0.5-5
[31] compiler_3.2.2 BradleyTerry2_1.0-6 scales_0.3.0 stats4_3.2.2 SparseM_1.7
[36] brglm_0.5-9 proto_0.3-10
>
答案 0 :(得分:3)
我对你提到的gafsControl()函数不太熟悉,但是在使用trainControl()设置并行种子时遇到了一个非常类似的问题。在说明中,它描述了如何创建列表(长度=重新采样的数量+ 1),其中每个项目是一个列表(长度=要测试的参数组合的数量)。我发现这样做不起作用(参见topepo / caret issue#248获取信息)。但是,如果您将每个项目转换为矢量,例如
09-18 00:16:12.614 25181-25181/com.example.somename I/EXPLORECA﹕ COLUMN_ID 1
09-18 00:16:12.614 25181-25181/com. example.somename I/EXPLORECA﹕ COLUMN_LATITUDE 21.36654189
09-18 00:16:12.614 25181-25181/com. example.somename I/EXPLORECA﹕ COLUMN_LONGITUDE 6.945669
09-18 00:16:12.614 25181-25181/com. example.somename I/EXPLORECA﹕ COLUMN_TIME 12:10:34 AM
09-18 00:16:12.614 25181-25181/com. example.somename I/EXPLORECA﹕ locationdata com. example.somename.LocationData@1b6b10f7
09-18 00:16:12.614 25181-25181/com. example.somename I/EXPLORECA﹕ locations [com. example.somename.LocationData@1b6b10f7]
然后种子似乎起作用(即模型和预测完全可重复)。我应该澄清这是使用doMC作为后端。对于其他并行后端,它可能会有所不同。
希望这有帮助
答案 1 :(得分:2)
我通过检查gafs.default
找出了我的错误。 seeds
内gafsControl()
vector
的长度为(n_repeats*nresampling)+1
而非list
(如trainControl$seeds
所示)。实际上在?gafsControl
的文档中说明seeds is a vector or integers that can be used to set the seed during each search. The number of seeds must be equal to the number of resamples plus one.
我认真思考,这提醒仔细阅读文档:D。
if (!is.null(gafsControl$seeds)) {
if (length(gafsControl$seeds) < length(gafsControl$index) +
1)
stop(paste("There must be at least", length(gafsControl$index) +
1, "random number seeds passed to gafsControl"))
}
else {
gafsControl$seeds <- sample.int(1e+05, length(gafsControl$index) +
1)
}
因此,设置ga_seeds
的正确方法是:
#Index for gafsControl
set.seed(1045481)
ga_index <- createFolds(train.set$Class, k=3)
#Seed for the gafsControl()
set.seed(1056)
ga_seeds <- sample.int(1500, 4)