我如何在Slurm下运行Open MPI

时间:2019-03-20 07:40:49

标签: openmpi slurm sbatch

我无法通过Open MPISlurm下运行Slurm-script

通常,我可以获取主机名并在计算机上运行Open MPI

$ mpirun hostname
myHost
$ cd NPB3.3-SER/ && make ua CLASS=B && mpirun -n 1 bin/ua.B.x inputua.data # Works

但是如果我通过slurm-script mpirun hostname进行相同的操作,则会返回空字符串,因此,我将无法运行mpirun -n 1 bin/ua.B.x inputua.data

slurm-script.sh:

#!/bin/bash
#SBATCH -o slurm.out        # STDOUT
#SBATCH -e slurm.err        # STDERR
#SBATCH --mail-type=ALL

export LD_LIBRARY_PATH="/usr/lib/openmpi/lib"
mpirun hostname > output.txt # Returns empty
cd NPB3.3-SER/ 
make ua CLASS=B 
mpirun --host myHost -n 1 bin/ua.B.x inputua.data
$ sbatch -N1 slurm-script.sh
Submitted batch job 1

我收到的错误:

There are no allocated resources for the application
  bin/ua.B.x
that match the requested mapping:    
------------------------------------------------------------------
Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.

A daemon (pid unknown) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
------------------------------------------------------------------

2 个答案:

答案 0 :(得分:1)

如果Slurm和OpenMPI是最新版本,请确保OpenMPI在Slurm支持下进行编译(运行ompi_info | grep slurm来查找),并仅在提交脚本中运行srun bin/ua.B.x inputua.data

或者,mpirun bin/ua.B.x inputua.data也应该工作。

如果在没有Slurm支持的情况下编译OpenMPI,则应该可以进行以下操作:

srun hostname > output.txt
cd NPB3.3-SER/ 
make ua CLASS=B 
mpirun --hostfile output.txt -n 1 bin/ua.B.x inputua.data

还要确保通过运行export LD_LIBRARY_PATH="/usr/lib/openmpi/lib"不会覆盖其他必要的库路径。最好是export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/lib/openmpi/lib"(如果您想避免开头的:如果最初是空的,则最好是a more complex version。)

答案 1 :(得分:0)

您需要做的是:1)运行mpirun,2)从slurm运行,3)运行--host。 要确定谁对此不起作用负责(问题1 ),您可以测试一些方法。 无论您进行什么测试,都应该通过命令行( CLI )和slurm S )对完全进行相同的测试。 可以理解,在 CLI S 情况下,其中一些测试会产生不同的结果。

一些注意事项是: 1)您不是在CLI和S中完全相同地测试。 2)您说您“无法运行mpirun -n 1 bin/ua.B.x inputua.data”,而问题实际上出在mpirun --host myHost -n 1 bin/ua.B.x inputua.data上。 3)mpirun hostname > output.txt返回一个空文件(问题2 )的事实不一定与您的主要问题有相同的出处,请参阅上面的段落。您可以使用scontrol show hostnames来解决此问题 或使用环境变量SLURM_NODELISTscontrol show hostnames所基于的环境变量),但这不能解决问题1。


要解决不是最重要的问题2 ,请通过CLI和S尝试一些操作。 下面的Slurm脚本可能会有所帮助。

#SBATCH -o slurm_hostname.out        # STDOUT
#SBATCH -e slurm_hostname.err        # STDERR
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}/usr/lib64/openmpi/lib"

mpirun hostname > hostname_mpirun.txt               # 1. Returns values ok for me
hostname > hostname.txt                             # 2. Returns values ok for me
hostname -s > hostname_slurmcontrol.txt             # 3. Returns values ok for me
scontrol show hostnames > hostname_scontrol.txt     # 4. Returns values ok for me
echo ${SLURM_NODELIST} > hostname_slurmcontrol.txt  # 5. Returns values ok for me

(有关export命令的说明,请参见this)。 根据您的说法,我知道2、3、4和5可以正常工作,而1则不能。 因此,您现在可以将mpirun与合适的选项--host--hostfile一起使用。

请注意scontrol show hostnames(例如,对于我cnode17<newline>cnode18)和echo ${SLURM_NODELIST}cnode[17-18])输出的格式不同。

也许还可以在slurm.conf中的%h%n动态设置的文件名中获取主机名,例如SlurmdLogFileSlurmdPidFile


要诊断/解决/解决问题1 ,请在CLI和S中尝试使用mpirun(有/没有--host)。 根据您的说法,假设在每种情况下都使用了正确的语法,则结果如下:

  1. mpirun,CLI(原始文章)。 “作品”。

  2. mpirun,S(注释?)。 与以下项目4相同的错误? 请注意,S中的mpirun hostname应该在您的slurm.err中产生类似的输出。

  3. mpirun --host,CLI(注释)。 错误

    There are no allocated resources for the application bin/ua.B.x that match the requested mapping:
    ...
    This may be because the daemon was unable to find all the needed shared
    libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
    location of the shared libraries on the remote nodes and this will
    automatically be forwarded to the remote nodes.
    
  4. mpirun --host,S(原始帖子)。 错误(与上面的第3项一样?)

    There are no allocated resources for the application
      bin/ua.B.x
    that match the requested mapping:    
    ------------------------------------------------------------------
    Verify that you have mapped the allocated resources properly using the
    --host or --hostfile specification.
    ...
    This may be because the daemon was unable to find all the needed shared
    libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
    location of the shared libraries on the remote nodes and this will
    automatically be forwarded to the remote nodes.
    

根据注释,您可能设置了错误的LD_LIBRARY_PATH路径。 您可能还需要使用mpi --prefix ...

相关吗? https://github.com/easybuilders/easybuild-easyconfigs/issues/204