我有一个问题是在parrallel中运行randomForest使用fore。 看到这个例子,我创建了一些数据,然后是公式表示法。 该公式本身适用于randomForest。 但是在foreach并行循环中使用时会失败......?
# rf on big training set
# use parallel foreach
library(foreach)
library(doMC)
registerDoMC(4) #change the 2 to your number of CPU cores
# info on parrallell backend
getDoParName()
getDoParWorkers()
# bogus data
set.seed(123)
ssize <- 100000
x1 <- sample( LETTERS[1:9], ssize, replace=TRUE, prob=c(0.1, 0.2, 0.15, 0.05,0.1, 0.2, 0.05, 0.05,0.1) )
x2 <- rlnorm(ssize,0,0.25)
x3 <- rlnorm(ssize,0,0.5)
y <- sample( c("Y","N"), ssize, replace=TRUE, prob=c(0.05, 0.95))
df <- data.frame(x1,x2,x3,y)
df$p_y <- as.numeric(df$y)-1
# use strata to sample whole dataset
library(sampling)
s1 = strata(df,stratanames = "y", size = c(2500,2500))
s2 = strata(df,stratanames = "y", size = c(2500,2500))
s3 = strata(df,stratanames = "y", size = c(2500,2500))
s4 = strata(df,stratanames = "y", size = c(2500,2500))
s_list <- list(s1$ID_unit, s2$ID_unit, s3$ID_unit, s4$ID_unit)
# model function
rf.formula <- as.formula(paste("y","~",paste("x1","x2",sep="+")))
library(randomForest)
# simple stuff works but takes some time
model.rf <-randomForest(y ~ x1 + x2, df, ntree=100, nodesize = 50)
# build rf with dopar on explicit formula works and is quick
model.rf.dopar <- foreach(subset=s_list, .combine=combine, .packages='randomForest') %dopar%
randomForest(y ~ x1 + x2, df, ntree=100, nodesize = 50, subset=subset)
# build rf with dopar on rf.formula fails
model.rf.s.b2 <- foreach(subset=s_list, .combine=combine, .packages='randomForest') %dopar%
randomForest(rf.formula, df, ntree=100, nodesize = 50, subset=subset)
# > model.rf.s.b2 <- foreach(subset=s_list, .combine=combine, .packages='randomForest') %dopar%
# + randomForest(rf.formula, df, ntree=100, nodesize = 50, subset=subset)
# Error in randomForest(rf.formula, df, ntree = 100, nodesize = 50, subset = subset) :
# task 1 failed - "invalid subscript type 'closure'"
错误:
model.rf.s.b2 <- foreach(subset=s_list, .combine=combine, .packages='randomForest') %dopar%
+ randomForest(rf.formula, df, ntree=100, nodesize = 50, subset=subset)
Error in randomForest(rf.formula, df, ntree = 100, nodesize = 50, subset = subset) :
task 1 failed - "invalid subscript type 'closure'"
有什么建议吗?
的Tx
答案 0 :(得分:2)
问题似乎是由于model.frame.default
函数内部的索引操作出错,由randomForest.formula
间接调用。我完全不确定触发问题的原因是因为model.frame.default
中发生了很多棘手的事件,但修改公式的环境似乎解决了这个问题:
r <- foreach(subset=s_list, .combine='combine', .multicombine=TRUE,
.packages='randomForest') %dopar% {
environment(rf.formula) <- environment()
randomForest(rf.formula, df, ntree=100, nodesize = 50, subset=subset)
}
特别是,这会导致subset
被正确评估,否则它将评估为subset
函数。我尝试重命名迭代变量,但它没有帮助。
请注意,我还将.multicombine
设置为TRUE
,因为randomForest combine
函数接受多个对象,这可以显着提高性能。
<强>更新强>
问题可以通过以下方式重现:
fun <- function(subset) {
randomForest(rf.formula, df, ntree=100, nodesize = 50, subset=subset)
}
fun(s_list[[1]])
例如,如果变量subset
更改为s
,它也会失败,但错误消息会更少:
> fun <- function(s) {
> randomForest(rf.formula, df, ntree=100, nodesize = 50, subset=s)
> }
> fun(s_list[[1]])
Error in eval(expr, envir, enclos) : object 's' not found
Calls: fun ... eval -> model.frame -> model.frame.default -> eval -> eval
Execution halted
与foreach
示例一样,重置公式的环境似乎解决了这个问题。