Slurm和Openmpi:一个ORTE守护进程在发布后和通信回到mpirun之前意外失败了

时间:2017-05-04 03:04:42

标签: mpi openmpi slurm

我在两个节点中安装了openmpi和slurm。我想用slurm来运行mpi工作。当我使用srun运行非mpi作业时,一切正常。但是,当我使用salloc来运行mpi作业时,我遇到了一些错误。环境和代码如下。

ENV:

  1. slurm 17.02.1-2
  2. mpirun(Open MPI)2.1.0
  3. test.sh

    #!/bin/bash
    
    MACHINEFILE="nodes.$SLURM_JOB_ID"
    
    # Generate Machinefile for mpich such that hosts are in the same
    #  order as if run via srun
    #
    srun -l /bin/hostname | sort -n | awk '{print $2}' > $MACHINEFILE
    
    source /home/slurm/allreduce/tf/tf-allreduce/bin/activate
    
    mpirun -np $SLURM_NTASKS -machinefile $MACHINEFILE test
    
    rm $MACHINEFILE
    

    命令

    salloc -N2 -n2 bash test.sh
    

    错误

    salloc: Granted job allocation 97
    --------------------------------------------------------------------------
    An ORTE daemon has unexpectedly failed after launch and before
    communicating back to mpirun. This could be caused by a number
    of factors, including an inability to create a connection back
    to mpirun due to a lack of common network interfaces and/or no
    route found between them. Please check network connectivity
    (including firewalls and network routing requirements).
    --------------------------------------------------------------------------
    salloc: Relinquishing job allocation 97
    

    任何人都可以提供帮助?感谢。

0 个答案:

没有答案