我在两个节点中安装了openmpi和slurm。我想用slurm来运行mpi工作。当我使用srun
运行非mpi作业时,一切正常。但是,当我使用salloc
来运行mpi作业时,我遇到了一些错误。环境和代码如下。
ENV:
test.sh
#!/bin/bash
MACHINEFILE="nodes.$SLURM_JOB_ID"
# Generate Machinefile for mpich such that hosts are in the same
# order as if run via srun
#
srun -l /bin/hostname | sort -n | awk '{print $2}' > $MACHINEFILE
source /home/slurm/allreduce/tf/tf-allreduce/bin/activate
mpirun -np $SLURM_NTASKS -machinefile $MACHINEFILE test
rm $MACHINEFILE
命令
salloc -N2 -n2 bash test.sh
错误
salloc: Granted job allocation 97
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
salloc: Relinquishing job allocation 97
任何人都可以提供帮助?感谢。