当我在插入符号中运行2个随机森林时,如果我设置随机种子,我会得到完全相同的结果:
library(caret)
library(doParallel)
set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))
set.seed(42)
model1 <- train(Species~., iris, method='rf', trControl=myControl)
set.seed(42)
model2 <- train(Species~., iris, method='rf', trControl=myControl)
> all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] TRUE
但是,如果我注册并行后端以加速建模,每次运行模型时都会得到不同的结果:
cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))
set.seed(42)
model1 <- train(Species~., iris, method='rf', trControl=myControl)
set.seed(42)
model2 <- train(Species~., iris, method='rf', trControl=myControl)
stopCluster(cl)
> all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] "Component 2: Mean relative difference: 0.01813729"
[2] "Component 3: Mean relative difference: 0.02271638"
有什么方法可以解决这个问题吗?一个建议是使用doRNG包,但train
使用嵌套循环,目前不支持:
library(doRNG)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
registerDoRNG()
set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))
set.seed(42)
> model1 <- train(Species~., iris, method='rf', trControl=myControl)
Error in list(e1 = list(args = seq(along = resampleIndex)(), argnames = "iter", :
nested/conditional foreach loops are not supported yet.
See the package's vignette for a work around.
更新:
我认为可以使用doSNOW
和clusterSetupRNG
解决此问题,但我无法完全实现。
set.seed(42)
library(caret)
library(doSNOW)
cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))
clusterSetupRNG(cl, seed=rep(12345,6))
a <- clusterCall(cl, runif, 10000)
model1 <- train(Species~., iris, method='rf', trControl=myControl)
clusterSetupRNG(cl, seed=rep(12345,6))
b <- clusterCall(cl, runif, 10000)
model2 <- train(Species~., iris, method='rf', trControl=myControl)
all.equal(a, b)
[1] TRUE
all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] "Component 2: Mean relative difference: 0.01890339"
[2] "Component 3: Mean relative difference: 0.01656751"
stopCluster(cl)
有关foreach的特别之处,为什么不使用我在群集中启动的种子?对象a
和b
是完全相同的,为什么不model1
和model2
?
答案 0 :(得分:46)
使用caret
包在并行模式下运行完全可重现模型的一种简单方法是在调用train控件时使用seeds参数。这里解决了上述问题,请查看trainControl帮助页面以获取更多信息。
library(doParallel); library(caret)
#create a list of seed, here change the seed for each resampling
set.seed(123)
#length is = (n_repeats*nresampling)+1
seeds <- vector(mode = "list", length = 11)
#(3 is the number of tuning parameter, mtry for rf, here equal to ncol(iris)-2)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 3)
#for the last model
seeds[[11]]<-sample.int(1000, 1)
#control list
myControl <- trainControl(method='cv', seeds=seeds, index=createFolds(iris$Species))
#run model in parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)
model1 <- train(Species~., iris, method='rf', trControl=myControl)
model2 <- train(Species~., iris, method='rf', trControl=myControl)
stopCluster(cl)
#compare
all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] TRUE
答案 1 :(得分:8)
因此,caret使用foreach包进行并行化。很有可能在每次迭代时设置种子,但我们需要在train
中设置更多选项。
或者,您可以创建一个自定义建模函数,模拟随机森林的内部函数并自行设置种子。
最高