First of all my setup till now:
I'm working on a fresh installed Ubuntu Gnome 15.10. on all pc's.
My networks consists of 4 pc's with static ips (198.168.0.1 - 198.168.0.4) with 198.168.0.4 as the master where I have installed open-mpi 1.10.2 in /opt/openmpi-1.10.2/
.
I share this and another folder (/home/cgv_wand/openmpi-1.10.2/
) via NFS to the other nodes. In the second folder I store my open-mpi application (just a sample-app for testing).
My /etc/exports-file for nfs looks like this:
/home/cgv_wand/openmpi-1.10.2 192.168.0.0/255.255.255.0(rw,sync,no_root_squash,no_subtree_check)
/opt/openmpi-1.10.2 192.168.0.0/255.255.255.0(rw,sync,no_root_squash,no_subtree_check)
I also defined the PATH and the LD_LIBRARY_PATH variables in the .bashrc's of the 4 pc's:
export PATH=:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/openmpi-1.10.2/bin
export LD_LIBRARY_PATH=:/opt/openmpi-1.10.2/lib/
Additionaly I have setup a ssh-server on each of the nodes (198.168.0.1 - 198.168.0.3) and shared the public key of my master-node with them for password-less login.
Now to my problem:
If I run a mpi-job via
mpirun -np 1 hello_c
everything is working fine. But if I try to run this job on for example 2 nodes it doesn't work (mary is the master, mila-1, mila-2, mila-3 are the other nodes):
mpirun -np 2 --host mary,mila-2 hello_c
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
If I try to run the job just on mila-2 (198.168.0.2) I get the following error:
mpirun -np 1 --host mila-2 hello_c
hello_c: error while loading shared libraries: libibverbs.so.1: cannot open shared object file: No such file or directory
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[54161,1],0]
Exit code: 127
--------------------------------------------------------------------------
I have already read the open-mpi faq and a lot of topics here but actually I have no idea what may cause this problems... So maybe someone here can help me
答案 0 :(得分:0)
错误肯定来自你的 ~/.bashrc 文件。你的环境变量在哪里?如果它们在最后,那么在外部节点上使用 ssh 运行 mpi 时不会编译 bashrc 文件的这部分,因为您进入的是非交互模式。请注意 ~/.bashrc 文件顶部的 if 退出函数,因此您需要在退出 if 之前将环境变量放在文件顶部。