将图表分布到跨群集节点

时间:2017-09-03 12:29:37

标签: distributed dask

我在Dask.delayed上取得了不错的进步。作为一个群体,我们决定花更多的时间使用Dask处理图形。

我对分发有疑问。我在群集上看到以下行为。我启动了,例如每个8个节点上有8个工作站,每个节点有4个线程,比如/我然后client.compute 8个图形来创建模拟数据以供后续处理。我希望每个节点生成一个8个数据集。然而,似乎发生的是,并非不合理地,八个函数在前两个节点上运行。后续计算在第一和第二节点上运行。因此,我看到缺乏扩展。随着时间的推移,其他节点将从诊断工作者页面中消失。这是预期的吗?

所以我想先按节点分发数据创建功能。所以,当我想计算图表时,现在我做了:

if nodes is not None:
    print("Computing graph_list on the following nodes: %s" % nodes)
    return client.compute(graph_list, sync=True, workers=nodes, **kwargs)
else:
    return client.compute(graph_list, sync=True, **kwargs)

这似乎设置正确:诊断进度条显示我的数据创建功能在内存中但它们无法启动。如果省略节点,则计算按预期进行。群集和桌面上都会出现此问题。

更多信息:查看调度程序日志,我确实看到了通信失败。

more dask-ssh_2017-09-04_09\:52\:09/dask_scheduler_sand-6-70\:8786.log
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO -   Scheduler at:    tcp://10.143.6.70:8786
distributed.scheduler - INFO -       bokeh at:              0.0.0.0:8787
distributed.scheduler - INFO -        http at:              0.0.0.0:9786
distributed.scheduler - INFO - Local Directory:    /tmp/scheduler-ny4ev7qh
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://10.143.6.73:36810
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.143.6.73:36810
distributed.scheduler - INFO - Register tcp://10.143.6.71:46656
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.143.6.71:46656
distributed.scheduler - INFO - Register tcp://10.143.7.66:42162
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.143.7.66:42162
distributed.scheduler - INFO - Register tcp://10.143.7.65:35114
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.143.7.65:35114
distributed.scheduler - INFO - Register tcp://10.143.6.70:43208
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.143.6.70:43208
distributed.scheduler - INFO - Register tcp://10.143.7.67:45228
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.143.7.67:45228
distributed.scheduler - INFO - Register tcp://10.143.6.72:36100
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.143.6.72:36100
distributed.scheduler - INFO - Register tcp://10.143.7.68:41915
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.143.7.68:41915
distributed.scheduler - INFO - Receive client connection: 5d1dab2a-914e-11e7-8bd1-180373ff6d8b
distributed.scheduler - INFO - Worker 'tcp://10.143.6.71:46656' failed from closed comm: Stream is clos
ed
distributed.scheduler - INFO - Remove worker tcp://10.143.6.71:46656
distributed.scheduler - INFO - Removed worker tcp://10.143.6.71:46656
distributed.scheduler - INFO - Worker 'tcp://10.143.6.73:36810' failed from closed comm: Stream is clos
ed
distributed.scheduler - INFO - Remove worker tcp://10.143.6.73:36810
distributed.scheduler - INFO - Removed worker tcp://10.143.6.73:36810
distributed.scheduler - INFO - Worker 'tcp://10.143.6.72:36100' failed from closed comm: Stream is clos
ed
distributed.scheduler - INFO - Remove worker tcp://10.143.6.72:36100
distributed.scheduler - INFO - Removed worker tcp://10.143.6.72:36100
distributed.scheduler - INFO - Worker 'tcp://10.143.7.67:45228' failed from closed comm: Stream is clos
ed
distributed.scheduler - INFO - Remove worker tcp://10.143.7.67:45228
distributed.scheduler - INFO - Removed worker tcp://10.143.7.67:45228
(arlenv) [hpccorn1@login-sand8 performance]$

这会引起任何可能的原因吗?

谢谢, 添

1 个答案:

答案 0 :(得分:0)

Dask如何选择将任务分配给工作人员是很复杂的,并考虑到许多问题,如负载平衡,数据传输,资源限制等等。如果没有具体和简单的事情,事情最终会很难解决例。

您可以尝试的一件事是立即提交所有计算,这使得调度程序可以做出稍微更智能的决策,而不是一次只看一个。

所以,您可以尝试替换这样的代码:

futures = [client.compute(delayed_value) for delayed_value in L]
wait(futures)

使用这样的代码

futures = client.compute(L)
wait(futures)

但老实说,我只有30%的机会解决你的问题。如果不深入探究你的问题,很难知道发生了什么。如果您可以提供非常简单的可重现代码示例,那么这将是最好的。