当通过LAN建立的MPI集群中使用两台主机时,mpiexec不会运行mpi4py脚本

时间:2016-02-22 16:05:55

标签: python ssh mpi hosting

所以我有另一台桌面PC,它作为我的服务器,primesystem和笔记本电脑作为我的客户端zerosystem连接到它。它们分别作为我的ssh-serverssh-client,并通过以太网(不是交叉)电缆连接。

我都遵循了这些教程中说明的说明: Running an MPI Cluster within a LANSetting Up an MPICH2 Cluster in Ubuntu,只是我想使用python的MPI实现,因此我使用mpi4py来测试两台PC是否都可以使用MPI。

我在/cloud中设置了一个目录primesystem,该目录应在我的网络中共享,并按照第一个教程的指示安装在我的zerosystem中(所以我也可以无需通过ssh登录即可在两个系统中工作。

在服务器或primesystem中,如果我运行示例helloworld脚本,它可以正常工作:

one@primesystem:/cloud$ mpirun -np 5 -hosts primesystem python -m mpi4py helloworld
Hello, World! I am process 0 of 5 on primesystem.
Hello, World! I am process 1 of 5 on primesystem.
Hello, World! I am process 2 of 5 on primesystem.
Hello, World! I am process 3 of 5 on primesystem.
Hello, World! I am process 4 of 5 on primesystem.

如果我通过主机zerosystem运行它也是一样(但应该注意,由于使用zerosystem的外部CPU,执行会有明显的延迟):

one@primesystem:/cloud$ mpirun -np 5 -hosts zerosystem python -m mpi4py helloworld
Hello, World! I am process 0 of 5 on zerosystem.
Hello, World! I am process 1 of 5 on zerosystem.
Hello, World! I am process 2 of 5 on zerosystem.
Hello, World! I am process 3 of 5 on zerosystem.
Hello, World! I am process 4 of 5 on zerosystem.

但如果我使用这两台主机,它似乎根本没有回复:

one@primesystem:/cloud$ mpirun -np 5 -hosts primesystem,zerosystem python -m mpi4py helloworld
Hello, World! I am process 0 of 5 on primesystem.

(如果我互换了主机的顺序,zerosystem是第一个,则不显示Hello World响应)

我尝试在.mpi-config文件中输入主机列表及其生成的相应流程,然后使用-f参数代替-hosts

zerosystem:4
primesystem:2

但它仍然得到相同的响应,几秒钟或几分钟后,这是错误输出:

one@primesystem:/cloud$ mpirun -np 6 -f .mpi-config python -m mpi4py helloworld
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 23329 RUNNING AT primesystem
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1@zerosystem] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:1@zerosystem] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1@zerosystem] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@primesystem] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@primesystem] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@primesystem] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@primesystem] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

这是为什么?有什么想法吗?

0 个答案:

没有答案