Question

我已经通过slurm（使用dask.distributed）在slurm管理的群集上的多个核心上启动了dask-mpi群集。所有进程似乎已经启动OK（slurm日志文件中看起来很正常的stdout），但是当我尝试使用client = Client(scheduler_file='/path/to/my/scheduler.json')从python中连接客户端时，我得到一个超时错误，如下所示：

distributed.utils - ERROR - Timed out trying to connect to 'tcp://141.142.181.102:8786' after 5 s: connect() didn't finish in time
Traceback (most recent call last):
  File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/distributed/comm/core.py", line 185, in connect
    quiet_exceptions=EnvironmentError)
  File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
tornado.gen.TimeoutError: Timeout

这些是发布后scheduler.json的内容。我不知道在这里列出工作流程是否正常，或者这是否表示设置存在问题：

{
  "type": "Scheduler",
  "id": "Scheduler-d0f65756-1b50-43a6-a044-93e4ef047ab7",
  "address": "tcp://141.142.181.102:8786",
  "services": {
    "bokeh": 8787
  },
  "workers": {}
}

我在两个不同的slurm管理集群上遇到了同样的问题。看起来我需要指定特定于端口的东西吗？如果是这样，我该如何确定需要使用哪些端口？

尝试在slurm管理的群集上连接dask.distributed客户端时出现超时错误

0 个答案: