我将以下软件包加载到R:
library(foreach)
library(doParallel)
library(iterators)
我长时间“并行化”代码,但最近我在代码运行时遇到INTERMITTENT停止。错误是:
Error in serialize(data, node$con) : error writing to connection
我有根据的猜测是,我使用以下命令打开的连接可能已过期:
## Register Cluster
##
cores<-8
cl <- makeCluster(cores)
registerDoParallel(cl)
查看makeCluster手册页我发现默认情况下连接只在30天后到期!我可以设置选项(错误=恢复),以便在代码停止时动态检查连接是否打开,但我决定在此之前发布这个一般性问题。
重要:
1)错误实际上是间歇性的,有时我重新运行相同的代码并且没有错误。 2)我在同一台多核机器(Intel / 8内核)上运行所有内容。因此,它不是群集中的通信(网络)问题。 3)我是笔记本电脑和台式机(64核心)上CPU和GPU并行化的重要用户。不幸的是,这是我第一次遇到这种类型的错误。
是否有人有相同类型的错误?
根据要求,我提供了sessionInfo():
> sessionInfo()
R version 2.15.3 (2013-03-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] TTR_0.22-0 xts_0.9-3 doParallel_1.0.1 iterators_1.0.6 foreach_1.4.0 zoo_1.7-9 Revobase_6.2.0 RevoMods_6.2.0
loaded via a namespace (and not attached):
[1] codetools_0.2-8 compiler_2.15.3 grid_2.15.3 lattice_0.20-13 tools_2.15.3
@SeteveWeston,低于其中一个调用中的错误(再次是间歇性的):
starting worker pid=8808 on localhost:10187 at 15:21:52.232
starting worker pid=5492 on localhost:10187 at 15:21:53.624
starting worker pid=8804 on localhost:10187 at 15:21:54.997
starting worker pid=8540 on localhost:10187 at 15:21:56.360
starting worker pid=6308 on localhost:10187 at 15:21:57.721
starting worker pid=8164 on localhost:10187 at 15:21:59.137
starting worker pid=8064 on localhost:10187 at 15:22:00.491
starting worker pid=8528 on localhost:10187 at 15:22:01.855
Error in unserialize(node$con) :
ReadItem: unknown type 0, perhaps written by later version of R
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
添加更多信息。我设置了选项(错误=恢复),它提供了以下信息:
Error in serialize(data, node$con) : error writing to connection
Enter a frame number, or 0 to exit
1: #51: parallelize(FUN = "ensemble.prism", arg = list(prism = iis.long, instances = oos.instances), vectorize.arg = c("prism", "instances"), cores = cores, .export
2: parallelize.R#58: foreach.bind(idx = i) %dopar% pFUN(idx)
3: e$fun(obj, substitute(ex), parent.frame(), e$data)
4: clusterCall(cl, workerInit, c.expr, exportenv, obj$packages)
5: sendCall(cl[[i]], fun, list(...))
6: postNode(con, "EXEC", list(fun = fun, args = args, return = return, tag = tag))
7: sendData(con, list(type = type, data = value, tag = tag))
8: sendData.SOCKnode(con, list(type = type, data = value, tag = tag))
9: serialize(data, node$con)
Selection: 9
我试图检查连接是否仍然可用,并且有:
Browse[1]> showConnections()
description class mode text isopen can read can write
3 "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes" "yes"
4 "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes" "yes"
5 "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes" "yes"
6 "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes" "yes"
7 "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes" "yes"
8 "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes" "yes"
9 "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes" "yes"
10 "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes" "yes"
Browse[1]>
由于连接是打开的,错误0表示R版本(由@SteveWeston指出),我真的可以理解这里发生了什么。
编辑1:
我的问题的解决方法
代码在传递给函数的参数方面很好。因此,由@MichaelFilosi提供的答案并没有带来太大的影响。无论如何,非常感谢您的回答!
我无法找到电话的确切错误,但至少我可以解决这个问题。
诀窍是将每个并行线程的函数调用参数分解为更小的块。
奇迹般错误消失了。
请告诉我这是否适用于您!
答案 0 :(得分:12)
这很可能是由于内存不足(有关详细信息,请参阅我的blog post)。以下是如何导致此错误的示例:
> a <- matrix(1, ncol=10^4*2.1, nrow=10^4)
> cl <- makeCluster(8, type = "FORK")
> parSapply(cl, 1:8, function(x) {
+ b <- a + 1
+ mean(b)
+ })
Error in unserialize(node$con) : error reading from connection
答案 1 :(得分:2)
我遇到了类似的错误 反序列化错误(节点$ con):从连接读取错误
我发现在调用C函数槽.Call()
时它是一个缺少的参数
也许它会有所帮助!
答案 2 :(得分:2)
我在这个问题上挣扎了很长一段时间,并且能够通过使用.packages=c("ex1","ex2")
将所有必需的包移动到foreach循环中的参数中来修复它。以前我刚刚在循环中使用require("ex1")
,这似乎是我的错误的根本原因。
总的来说,我只是确保你将所有可能的东西都移到foreach参数中以避免这些类型的错误。
答案 3 :(得分:1)
我遇到了同样的问题,我怀疑这是一个内存问题。我的代码很简单:
library(doParallel)
library(foreach)
cl <- makeCluster(2, outfile='LOG.TXT')
registerDoParallel(cl)
res <- foreach(x=1:10) %dopar% x
我在LOG.TXT
中收到以下错误消息:
starting worker pid=13384 on localhost:11776 at 18:25:29.873
starting worker pid=21668 on localhost:11776 at 18:25:30.266
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
该程序无论如何都有效,所以我暂时忽略了它。但是,我总是觉得在日志文件中看到这些错误感到不舒服。
答案 4 :(得分:0)
在Shiny中,我通过写入并行代码中的reactiveValues对象来导致此错误
答案 5 :(得分:0)
使用def rename_files(r_files, token):
for file in r_files:
if r_files[file]['new_name']:
url = 'https://www.googleapis.com/drive/v2/files/'
header = {'Authorization': 'Bearer ' + token}
data = {'title': r_files[file]['new_name']}
r = requests.put(url+file, headers=header, json=data)
print (r.json())
和后端foreach
时出现相同的错误。
超时后,我收到与op相同的错误,但是在不使用doSNOW
的情况下运行任务时,不会返回任何错误。
显然,任务管理器可以由于多种原因而杀死进程,而不仅仅是内存不足。
在我的特定情况下,问题似乎出在核心温度上。减少CPU内核数并进行foreach
调用会使系统运行温度降低,并且错误停止出现。
可能值得一试。