我有以下代码使用随机森林作为方法,如果您在同一台计算机上以并行模式运行它,则该代码完全可重现:
library(doParallel)
library(caret)
recursive_feature_elimination <- function(dat){
all_preds <- dat[,which(names(dat) %in% c("Time", "Chick", "Diet"))]
response <- dat[,which(names(dat) == "weight")]
sizes <- c(1:(ncol(all_preds)-1))
# set seeds manually
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
# an optional vector of integers for the size. The vector should have length of length(sizes)+1
# length is n_repeats*nresampling+1
seeds <- vector(mode = "list", length = 16)
for(i in 1:15) seeds[[i]]<- sample.int(n=1000, size = length(sizes)+1)
# for the last model
seeds[[16]]<-sample.int(1000, 1)
seeds_list <- list(rfe_seeds = seeds,
train_seeds = NA)
# specify rfeControl
contr <- caret::rfeControl(functions=rfFuncs, method="repeatedcv", number=3, repeats=5,
saveDetails = TRUE, seeds = seeds, allowParallel = TRUE)
# recursive feature elimination caret
results <- caret::rfe(x = all_preds,
y = response,
sizes = sizes,
method ="rf",
ntree = 250,
metric= "RMSE",
rfeControl=contr )
return(results)
}
dat <- as.data.frame(ChickWeight)
cores <- detectCores()
cl <- makePSOCKcluster(cores, outfile="")
registerDoParallel(cl)
results <- recursive_feature_elimination(dat)
stopCluster(cl)
registerDoSEQ()
我的机器上的结果是:
Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
1 39.14 0.6978 24.60 2.755 0.02908 1.697
2 23.12 0.8998 13.90 2.675 0.02273 1.361 *
3 28.18 0.8997 20.32 2.243 0.01915 1.225
The top 2 variables (out of 2):
Time, Chick
我正在使用具有一个CPU和4个核心的Windows操作系统。如果代码在使用具有多个内核的多个CPU的UNIX OS上运行,则结果会有所不同。我认为这种现象是由于随机数生成而引起的,这在我的系统和多CPU系统之间显然有所不同。 train()
也会发生同样的情况。
我如何独立于操作系统以及与用于并行化的CPU和内核数无关地获得完全可重复的结果?
我如何确保rfe
和randomForest
的每个内部进程使用相同的随机数,而不管进程并行运行的顺序是什么?
如何为每个并行进程生成随机数?