我已经通过slurm(使用dask.distributed
)在slurm管理的群集上的多个核心上启动了dask-mpi
群集。所有进程似乎已经启动OK(slurm日志文件中看起来很正常的stdout),但是当我尝试使用client = Client(scheduler_file='/path/to/my/scheduler.json')
从python中连接客户端时,我得到一个超时错误,如下所示:
distributed.utils - ERROR - Timed out trying to connect to 'tcp://141.142.181.102:8786' after 5 s: connect() didn't finish in time
Traceback (most recent call last):
File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/distributed/comm/core.py", line 185, in connect
quiet_exceptions=EnvironmentError)
File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
tornado.gen.TimeoutError: Timeout
这些是发布后scheduler.json
的内容。我不知道在这里列出工作流程是否正常,或者这是否表示设置存在问题:
{
"type": "Scheduler",
"id": "Scheduler-d0f65756-1b50-43a6-a044-93e4ef047ab7",
"address": "tcp://141.142.181.102:8786",
"services": {
"bokeh": 8787
},
"workers": {}
}
我在两个不同的slurm管理集群上遇到了同样的问题。看起来我需要指定特定于端口的东西吗?如果是这样,我该如何确定需要使用哪些端口?