在doMC的foreach和dopar中调用其他包的注意事项是什么?

时间:2017-05-03 21:49:07

标签: r parallel-processing stanford-nlp domc

此代码按预期工作:

library(dplyr)
data <- list(t1 = "hello world.", t2 = "bye world")

library(doMC)
registerDoMC(3)

res <- foreach(t = data) %dopar% {

    print(sprintf("processing %s", t))

    data.frame(text = t) %>%
    dplyr::count(text)

}

print(res)

然而,这段代码只是打印&#34;处理你好世界。&#34;和#34;处理再见世界&#34;然后挂起(没有抛出异常)。

library(dplyr)
coreNLP::initCoreNLP()

data <- list(t1 = "hello world.", t2 = "bye world")

library(doMC)
registerDoMC(3)

res <- foreach(t = data) %dopar% {

    print(sprintf("processing %s", t))

    coreNLP::annotateString(t)$token

}

print(res)

如果我将%dopar%更改为%do%,则上述代码将按预期工作。

我不明白是什么导致了这种行为。为什么在%dopar%内调用coreNLP函数导致R挂起但与其他包一起工作正常?这是否与coreNLP对Java的依赖有关?

这是sessionInfo()的输出:

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.0

1 个答案:

答案 0 :(得分:1)

你的第一个例子对于我看起来像是类似的设置就好了。运行示例后的会话信息如下;请务必使用新的R会话(R --vanilla)重试。我有四个核心(来自parallel::detectCores())。

sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] doMC_1.3.4      iterators_1.0.8 foreach_1.4.3   dplyr_0.5.0    

loaded via a namespace (and not attached):
[1] compiler_3.4.0   magrittr_1.5     R6_2.2.0         assertthat_0.2.0
[5] DBI_0.6-1        tibble_1.3.0     Rcpp_0.12.10     codetools_0.2-15

你的第二个例子也适用于我。输出如下。我的猜测是分叉进程可以共享coreNLP所依赖的相同底层Java进程/服务;我真的不知道coreNLP。

> res <- foreach(t = data) %dopar% {
+ 
+     print(sprintf("processing %s", t))
+ 
+     coreNLP::annotateString(t)$token
+ 
+ }
[1] "processing hello world."
[1] "processing bye world"


^CError in selectChildren(ac, 1) : 
  Java called System.exit(130) requesting R to quit - trying to recover
Error during wrapup: C stack usage  591577121812 is too close to the limit

 *** caught segfault ***
address 0x2, cause 'memory not mapped'