R中并行处理时无法识别节点错误

时间:2015-09-29 18:26:07

标签: r parallel-processing netlogo

我使用NetLogo程序开发了一个模型。使用RNetLogo包使用R运行NetLogo,我要求模型在8个处理器(节点)上同时运行每个参数集,然后将这些运行的8个输出编译成单个csv。然后,该过程连续重复3次,以获得每个参数组合的总共32(8 * 4)个模拟。它还给了我4个独立的csv,我后来将它们组合成一个数据集进行分析。至少这是我过去所做的,而且效果很好。

现在,我已经稍微修改了代码,并且我偶尔会收到一个关闭其中一个进程的错误(即,一个节点上的错误会关闭所有8个节点)。所以我得到8,16或24个模拟而不是32个。

我要求8个重复中的每一个重复显示模拟过程中它们的距离,以确定它们是否在模拟中间某处出现错误。但是,似乎大多数模拟完成,而一个或两个节点根本没有启动。

 Error in checkForRemoteErrors(val) : one node produced an error: 
 Calls: rep.sim ... clusterApply -> staticClusterApply -> checkForRemoteErrors
 Execution halted

这表明它不是NetLogo模型的输出,而是执行R代码(rep.sim)或write.csv中的某些内容?

关于如何诊断这一点的任何想法都会非常棒。下面是R代码,它使用RNetLogo包来控制NetLogo并将模型发送到服务器上的多个节点。

#
#with parallel processing
#
library(parallel)
nl.path <- "/nfs/ncarter-data/netlogo-parallel/NetLogo-5.1.0"
model.path <- "/nfs/ncarter-data/netlogo-parallel/NetLogo_model_cluster_test/June_1_CNP_resource_use.nlogo"
model.directory <- "/nfs/ncarter-data/netlogo-parallel/NetLogo_model_cluster_test"
gui <- FALSE
#
# Create an output dir if the OUTPUT_DIR environment is set, otherwise use current dir
#
outputdir <- Sys.getenv('OUTPUT_DIR')
if (nchar(outputdir) == 0) {
 outputdir <- getwd()
}
setwd(model.directory)
#
# Startup NetLogo
#
prepro <- function(dummy, gui, nl.path, model.path) {
  library(RNetLogo)
  NLStart(nl.path, gui=gui)
  NLLoadModel(model.path)
}
#
# Startup cluster using all available cores
#
processors <- detectCores()
cl <- makePSOCKcluster(processors)
#
# initializing parallel processors
#
invisible(parLapply(cl, 1:processors, prepro, gui=gui, nl.path=nl.path, model.path=model.path))
#
# Function to cancel parallel processing
#
postpro <- function(x) {
  NLQuit()
}
#
#Function to run model simulation
#
sim <- function(per_pixel_prey_remove){
results=list()
NLCommand("set per-pixel-prey-remove", per_pixel_prey_remove, "set entire-site-prey-remove 0.05", "setup", "go")
ret <- NLDoReport(440,"go",c("per-pixel-prey-remove","dead-male-chall","dead-fem-starv","dead-adult-fem","dead-adult-male",
                             "dead-cub-male","dead-cub-fem","dead-juv-male","dead-juv-fem","dead-tran-male", 
                             "dead-tran-fem","num-infanticide","count breeding-males","count breeding-females",
                             "count cub-males","count cub-females","count juvenile-females",
                             "count juvenile-males","count transient-males","count transient-females",
                             "count males","count females","count breeding-females with [count my-offspring > 0]", 
                             "mean [count territory] of breeding-females","mean [count territory] of breeding-males",
                             "mean [count females-in-my-territory] of breeding-males with [count females-in-my-territory > 0]",
                             "mean [count females-in-my-territory] of breeding-males"), 
                  as.data.frame=TRUE);
results[[1]]=ret
return(results)
}
#
# Function to replicate simulation for each parameter value
#
 rep.sim <- function(per_pixel_prey_remove, rep) {
   return(
     parLapply(cl, replicate(rep, d), sim)) 
 }
 d <- seq(0.25,1,0.25)
 per.pixel.prey.remove <- rep.sim(d,processors)
#
# Write Output File
#
write.csv(per.pixel.prey.remove,file.path(outputdir,paste(Sys.getenv("JOB_NAME"),"_output_",format(Sys.time(),"%Y_%m_%d_%H%M%S"),".csv",sep='')))
#
#quit all parallel processing
#
invisible(parLapply(cl, 1:processors, postpro))
stopCluster(cl)  

1 个答案:

答案 0 :(得分:0)

仅供参考,我的NetLogo代码在安装过程中出错。当节点无法正确设置模型时,它会崩溃整个群集。