我无法通过Open MPI
在Slurm
下运行Slurm-script
。
通常,我可以获取主机名并在计算机上运行Open MPI
。
$ mpirun hostname
myHost
$ cd NPB3.3-SER/ && make ua CLASS=B && mpirun -n 1 bin/ua.B.x inputua.data # Works
但是如果我通过slurm-script mpirun hostname
进行相同的操作,则会返回空字符串,因此,我将无法运行mpirun -n 1 bin/ua.B.x inputua.data
。
slurm-script.sh:
#!/bin/bash
#SBATCH -o slurm.out # STDOUT
#SBATCH -e slurm.err # STDERR
#SBATCH --mail-type=ALL
export LD_LIBRARY_PATH="/usr/lib/openmpi/lib"
mpirun hostname > output.txt # Returns empty
cd NPB3.3-SER/
make ua CLASS=B
mpirun --host myHost -n 1 bin/ua.B.x inputua.data
$ sbatch -N1 slurm-script.sh
Submitted batch job 1
我收到的错误:
There are no allocated resources for the application
bin/ua.B.x
that match the requested mapping:
------------------------------------------------------------------
Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.
A daemon (pid unknown) died unexpectedly with status 1 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
------------------------------------------------------------------
答案 0 :(得分:1)
如果Slurm和OpenMPI是最新版本,请确保OpenMPI在Slurm支持下进行编译(运行ompi_info | grep slurm
来查找),并仅在提交脚本中运行srun bin/ua.B.x inputua.data
。
或者,mpirun bin/ua.B.x inputua.data
也应该工作。
如果在没有Slurm支持的情况下编译OpenMPI,则应该可以进行以下操作:
srun hostname > output.txt
cd NPB3.3-SER/
make ua CLASS=B
mpirun --hostfile output.txt -n 1 bin/ua.B.x inputua.data
还要确保通过运行export LD_LIBRARY_PATH="/usr/lib/openmpi/lib"
不会覆盖其他必要的库路径。最好是export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/lib/openmpi/lib"
(如果您想避免开头的:
如果最初是空的,则最好是a more complex version。)
答案 1 :(得分:0)
您需要做的是:1)运行mpirun
,2)从slurm
运行,3)运行--host
。
要确定谁对此不起作用负责(问题1 ),您可以测试一些方法。
无论您进行什么测试,都应该通过命令行( CLI )和slurm
( S )对完全进行相同的测试。
可以理解,在 CLI 和 S 情况下,其中一些测试会产生不同的结果。
一些注意事项是:
1)您不是在CLI和S中完全相同地测试。
2)您说您“无法运行mpirun -n 1 bin/ua.B.x inputua.data
”,而问题实际上出在mpirun --host myHost -n 1 bin/ua.B.x inputua.data
上。
3)mpirun hostname > output.txt
返回一个空文件(问题2 )的事实不一定与您的主要问题有相同的出处,请参阅上面的段落。您可以使用scontrol show hostnames
来解决此问题
或使用环境变量SLURM_NODELIST
(scontrol show hostnames
所基于的环境变量),但这不能解决问题1。
#SBATCH -o slurm_hostname.out # STDOUT
#SBATCH -e slurm_hostname.err # STDERR
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}/usr/lib64/openmpi/lib"
mpirun hostname > hostname_mpirun.txt # 1. Returns values ok for me
hostname > hostname.txt # 2. Returns values ok for me
hostname -s > hostname_slurmcontrol.txt # 3. Returns values ok for me
scontrol show hostnames > hostname_scontrol.txt # 4. Returns values ok for me
echo ${SLURM_NODELIST} > hostname_slurmcontrol.txt # 5. Returns values ok for me
(有关export
命令的说明,请参见this)。
根据您的说法,我知道2、3、4和5可以正常工作,而1则不能。
因此,您现在可以将mpirun
与合适的选项--host
或--hostfile
一起使用。
请注意scontrol show hostnames
(例如,对于我cnode17<newline>cnode18
)和echo ${SLURM_NODELIST}
(cnode[17-18]
)输出的格式不同。
也许还可以在slurm.conf
中的%h
和%n
动态设置的文件名中获取主机名,例如SlurmdLogFile
,SlurmdPidFile
。
mpirun
(有/没有--host
)。
根据您的说法,假设在每种情况下都使用了正确的语法,则结果如下:
mpirun
,CLI(原始文章)。
“作品”。
mpirun
,S(注释?)。
与以下项目4相同的错误?
请注意,S中的mpirun hostname
应该在您的slurm.err
中产生类似的输出。
mpirun --host
,CLI(注释)。
错误
There are no allocated resources for the application bin/ua.B.x that match the requested mapping:
...
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
mpirun --host
,S(原始帖子)。
错误(与上面的第3项一样?)
There are no allocated resources for the application
bin/ua.B.x
that match the requested mapping:
------------------------------------------------------------------
Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.
...
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
根据注释,您可能设置了错误的LD_LIBRARY_PATH
路径。
您可能还需要使用mpi --prefix ...
相关吗? https://github.com/easybuilders/easybuild-easyconfigs/issues/204