Setting up OpenMPI-1.10.2 to run jobs on multiple nodes

时间:2016-04-07 10:43:43

标签: ubuntu openmpi

First of all my setup till now:
I'm working on a fresh installed Ubuntu Gnome 15.10. on all pc's. My networks consists of 4 pc's with static ips (198.168.0.1 - 198.168.0.4) with 198.168.0.4 as the master where I have installed open-mpi 1.10.2 in /opt/openmpi-1.10.2/.
I share this and another folder (/home/cgv_wand/openmpi-1.10.2/) via NFS to the other nodes. In the second folder I store my open-mpi application (just a sample-app for testing).

My /etc/exports-file for nfs looks like this:

/home/cgv_wand/openmpi-1.10.2   192.168.0.0/255.255.255.0(rw,sync,no_root_squash,no_subtree_check)  
/opt/openmpi-1.10.2             192.168.0.0/255.255.255.0(rw,sync,no_root_squash,no_subtree_check)

I also defined the PATH and the LD_LIBRARY_PATH variables in the .bashrc's of the 4 pc's:

export PATH=:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/openmpi-1.10.2/bin
export LD_LIBRARY_PATH=:/opt/openmpi-1.10.2/lib/

Additionaly I have setup a ssh-server on each of the nodes (198.168.0.1 - 198.168.0.3) and shared the public key of my master-node with them for password-less login.

Now to my problem:
If I run a mpi-job via

mpirun -np 1 hello_c

everything is working fine. But if I try to run this job on for example 2 nodes it doesn't work (mary is the master, mila-1, mila-2, mila-3 are the other nodes):

mpirun -np 2 --host mary,mila-2 hello_c

bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

If I try to run the job just on mila-2 (198.168.0.2) I get the following error:

mpirun -np 1 --host mila-2 hello_c
hello_c: error while loading shared libraries: libibverbs.so.1: cannot open shared object file: No such file or directory
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[54161,1],0]
  Exit code:    127
--------------------------------------------------------------------------

I have already read the open-mpi faq and a lot of topics here but actually I have no idea what may cause this problems... So maybe someone here can help me

1 个答案:

答案 0 :(得分:0)

错误肯定来自你的 ~/.bashrc 文件。你的环境变量在哪里?如果它们在最后,那么在外部节点上使用 ssh 运行 mpi 时不会编译 bashrc 文件的这部分,因为您进入的是非交互模式。请注意 ~/.bashrc 文件顶部的 if 退出函数,因此您需要在退出 if 之前将环境变量放在文件顶部。