所以我有另一台桌面PC,它作为我的服务器,primesystem
和笔记本电脑作为我的客户端zerosystem
连接到它。它们分别作为我的ssh-server
和ssh-client
,并通过以太网(不是交叉)电缆连接。
我都遵循了这些教程中说明的说明:
Running an MPI Cluster within a LAN和Setting Up an MPICH2 Cluster in Ubuntu,只是我想使用python
的MPI实现,因此我使用mpi4py
来测试两台PC是否都可以使用MPI。
我在/cloud
中设置了一个目录primesystem
,该目录应在我的网络中共享,并按照第一个教程的指示安装在我的zerosystem
中(所以我也可以无需通过ssh
登录即可在两个系统中工作。
在服务器或primesystem
中,如果我运行示例helloworld脚本,它可以正常工作:
one@primesystem:/cloud$ mpirun -np 5 -hosts primesystem python -m mpi4py helloworld
Hello, World! I am process 0 of 5 on primesystem.
Hello, World! I am process 1 of 5 on primesystem.
Hello, World! I am process 2 of 5 on primesystem.
Hello, World! I am process 3 of 5 on primesystem.
Hello, World! I am process 4 of 5 on primesystem.
如果我通过主机zerosystem
运行它也是一样(但应该注意,由于使用zerosystem
的外部CPU,执行会有明显的延迟):
one@primesystem:/cloud$ mpirun -np 5 -hosts zerosystem python -m mpi4py helloworld
Hello, World! I am process 0 of 5 on zerosystem.
Hello, World! I am process 1 of 5 on zerosystem.
Hello, World! I am process 2 of 5 on zerosystem.
Hello, World! I am process 3 of 5 on zerosystem.
Hello, World! I am process 4 of 5 on zerosystem.
但如果我使用这两台主机,它似乎根本没有回复:
one@primesystem:/cloud$ mpirun -np 5 -hosts primesystem,zerosystem python -m mpi4py helloworld
Hello, World! I am process 0 of 5 on primesystem.
(如果我互换了主机的顺序,zerosystem
是第一个,则不显示Hello World响应)
我尝试在.mpi-config
文件中输入主机列表及其生成的相应流程,然后使用-f
参数代替-hosts
zerosystem:4
primesystem:2
但它仍然得到相同的响应,几秒钟或几分钟后,这是错误输出:
one@primesystem:/cloud$ mpirun -np 6 -f .mpi-config python -m mpi4py helloworld
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 23329 RUNNING AT primesystem
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1@zerosystem] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:1@zerosystem] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1@zerosystem] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@primesystem] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@primesystem] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@primesystem] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@primesystem] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
这是为什么?有什么想法吗?