我正在尝试在多节点群集上使用install openMPI和mpich2,但在这两种情况下我都无法在多台计算机上运行。使用mpich2我能够从头节点在特定主机上运行,但如果我尝试从计算节点运行某些东西到另一个节点,我得到:
HYDU_sock_connect (utils/sock/sock.c:172): unable to connect from "destination_node" to "parent_node" (No route to host)
[proxy:0:0@destination_node] main (pm/pmiserv/pmip.c:189): unable to connect to server parent_node at port 56411 (check for firewalls!)
如果我尝试使用sge来设置作业,我会遇到类似的错误。
另一方面,如果我尝试使用openMPI来运行作业,我就无法在任何远程计算机上运行,即使是从头节点运行。我明白了:
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
这些机器相互连接,我可以ping,ssh无密码等,从任何一台机器到任何其他机器,MPI_LIB和PATH都可以在所有机器中完好设置。
答案 0 :(得分:0)
通常这是因为您没有在命令行上设置主机文件或传递主机列表。
对于MPICH,您可以通过在命令行上传递标记-host
,然后是主机列表(host1
,host2
,host3
等来执行此操作。 )。
mpiexec -host host1,host2,host3 -n 3 <executable>
您也可以将它们放在一个文件中:
host1
host2
host3
然后在命令行上传递该文件,如下所示:
mpiexec -f <hostfile> -n 3 <executable>
同样,使用Open MPI,您可以使用:
mpiexec --host host1,host2,host3 -n 3 <executable>
和
mpiexec --hostfile hostfile -n 3 <executable>
您可以通过以下链接获取更多信息: