Question

我使用自定义ClusterManager在单独的节点上启动了一些Julia工作者。标准TCP / IP传输。

我可以在worker上运行remotecall，但是当我要求远程工作者使用println时，它会因管道异常中断而失败。

知道为什么会这样吗？

julia> remotecall_fetch(90, gethostname)
"gpu-8.local"

julia> remotecall_fetch(90, println, "test")
ERROR: On worker 90:
write: broken pipe (EPIPE)
 in yieldto at ./task.jl:71
 in wait at ./task.jl:371
 in stream_wait at ./stream.jl:60
 in uv_write at stream.jl:962
 in buffer_or_write at stream.jl:972
 in write at stream.jl:1011
 in print at strings/io.jl:46
 in print at strings/io.jl:18
 in println at strings/io.jl:25
 in println at strings/io.jl:28
 in anonymous at multi.jl:923
 in run_work_thunk at multi.jl:661
 [inlined code] from multi.jl:923
 in anonymous at task.jl:63
 in remotecall_fetch at multi.jl:747
 in remotecall_fetch at multi.jl:750

Answer 1

经过几个小时的汗水和泪水回答了我自己的问题。我误解了文档的细节。集群管理器必须维护stdout IO流并将其传递给WorkerConfig.io字段。

我在文档中注意到这一行：

集群管理器捕获每个工作程序的stdout并使其可供主进程使用

我最初认为只是在初始握手期间，当工作人员将他们的IP /端口写入stdout并且需要由主设备捕获以启动会话时。但现在我看到集群管理器需要不断地将stdout从worker更改为master。

使用println时，Julia worker会生成断管异常

1 个答案: