openMPI / mpich2不在多个节点上运行

时间:2014-11-21 01:39:41

标签: networking mpi openmpi sungridengine mpich

我正在尝试在多节点群集上使用install openMPI和mpich2,但在这两种情况下我都无法在多台计算机上运行。使用mpich2我能够从头节点在特定主机上运行,​​但如果我尝试从计算节点运行某些东西到另一个节点,我得到:

HYDU_sock_connect (utils/sock/sock.c:172): unable to connect from "destination_node" to "parent_node" (No route to host)
[proxy:0:0@destination_node] main (pm/pmiserv/pmip.c:189): unable to connect to server parent_node at port 56411 (check for firewalls!)

如果我尝试使用sge来设置作业,我会遇到类似的错误。

另一方面,如果我尝试使用openMPI来运行作业,我就无法在任何远程计算机上运行,​​即使是从头节点运行。我明白了:

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

这些机器相互连接,我可以ping,ssh无密码等,从任何一台机器到任何其他机器,MPI_LIB和PATH都可以在所有机器中完好设置。

1 个答案:

答案 0 :(得分:0)

通常这是因为您没有在命令行上设置主机文件或传递主机列表。

对于MPICH,您可以通过在命令行上传递标记-host,然后是主机列表(host1host2host3等来执行此操作。 )。

mpiexec -host host1,host2,host3 -n 3 <executable>

您也可以将它们放在一个文件中:

host1
host2
host3

然后在命令行上传递该文件,如下所示:

mpiexec -f <hostfile> -n 3 <executable>

同样,使用Open MPI,您可以使用:

mpiexec --host host1,host2,host3 -n 3 <executable>

mpiexec --hostfile hostfile -n 3 <executable>

您可以通过以下链接获取更多信息: