使用不同的用户名在LAN群集上运行MPI

时间:2018-11-10 05:26:34

标签: mpi host

我有两台使用不同用户名的计算机:假设为user1@masteruser2@slave。我想在两台计算机上运行MPI作业,但直到现在我一直没有成功。我已经成功在两台机器之间设置了无密码的ssh。两台机器具有相同的OpenMPI版本,并且两台机器分别具有PATHLD_LIBRARY_PATH设置。

每台计算机上的openmpi路径为/home/$USER/.openmpi,我要运行的程序位于~/folder

两台机器上的/ etc / hosts文件:

master x.x.x.110
slave  x.x.x.111

我在user1@master上的/.ssh/config文件:

Host slave
User user2

然后我在user1@master内的~/folder上执行命令,如下所示:

$ mpiexec -n 1 ./program : -np 1 -host slave -wdir /home/user2/folder ./program

我收到以下错误:

bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

编辑

如果我使用包含内容的主机文件:

localhost
user2@slave

--mca参数一起出现以下错误:

$ mpirun --mca plm_base_verbose 10 -n 5 --hostfile hosts.txt ./program
[user:29277] mca: base: components_register: registering framework plm components
[user:29277] mca: base: components_register: found loaded component slurm
[user:29277] mca: base: components_register: component slurm register function successful
[user:29277] mca: base: components_register: found loaded component isolated
[user:29277] mca: base: components_register: component isolated has no register or open function
[user:29277] mca: base: components_register: found loaded component rsh
[user:29277] mca: base: components_register: component rsh register function successful
[user:29277] mca: base: components_open: opening plm components
[user:29277] mca: base: components_open: found loaded component slurm
[user:29277] mca: base: components_open: component slurm open function successful
[user:29277] mca: base: components_open: found loaded component isolated
[user:29277] mca: base: components_open: component isolated open function successful
[user:29277] mca: base: components_open: found loaded component rsh
[user:29277] mca: base: components_open: component rsh open function successful
[user:29277] mca:base:select: Auto-selecting plm components
[user:29277] mca:base:select:(  plm) Querying component [slurm]
[user:29277] mca:base:select:(  plm) Querying component [isolated]
[user:29277] mca:base:select:(  plm) Query of component [isolated] set priority to 0
[user:29277] mca:base:select:(  plm) Querying component [rsh]
[user:29277] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[user:29277] mca:base:select:(  plm) Selected component [rsh]
[user:29277] mca: base: close: component slurm closed
[user:29277] mca: base: close: unloading component slurm
[user:29277] mca: base: close: component isolated closed
[user:29277] mca: base: close: unloading component isolated
[user:29277] *** Process received signal ***
[user:29277] Signal: Segmentation fault (11)
[user:29277] Signal code:  (128)
[user:29277] Failing at address: (nil)
[user:29277] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f4226242f20]
[user:29277] [ 1] /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x197)[0x7f422629b207]
[user:29277] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup_function+0x10a)[0x7f422634d06a]
[user:29277] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup+0x3d)[0x7f422634d19d]
[user:29277] [ 4] /lib/x86_64-linux-gnu/libc.so.6(getpwuid_r+0x2f3)[0x7f42262e7ee3]
[user:29277] [ 5] /lib/x86_64-linux-gnu/libc.so.6(getpwuid+0x98)[0x7f42262e7498]
[user:29277] [ 6] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x477d)[0x7f422356977d]
[user:29277] [ 7] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x67a7)[0x7f422356b7a7]
[user:29277] [ 8] /home/.openmpi/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xdc9)[0x7f4226675749]
[user:29277] [ 9] mpirun(+0x1262)[0x563fde915262]
[user:29277] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f4226225b97]
[user:29277] [11] mpirun(+0xe7a)[0x563fde914e7a]
[user:29277] *** End of error message ***
Segmentation fault (core dumped)

我没有收到要求的任何ssh orte信息,但可能是因为我误用了--mca命令?

0 个答案:

没有答案