在Slurm和使用命令下,MPI结果有所不同

时间:2019-11-08 16:28:07

标签: c++ mpi slurm

运行Slurm的MPI项目时遇到了一个问题。

a1是我的可执行文件。 我刚运行mpiexec -np 4 ./a1

时效果很好

但是当我在Slurm下运行它时,它不能很好地工作,并且看起来好像停在中间:

这是使用mpiexec -np 4 ./a1的输出,这是正确的。

Processor1 will send and receive with processor0
Processor3 will send and receive with processor0
Processor0 will send and receive with processor1
Processor0 finished send and receive with processor1
Processor1 finished send and receive with processor0
Processor2 will send and receive with processor0
Processor1 will send and receive with processor2
Processor2 finished send and receive with processor0
Processor0 will send and receive with processor2
Processor0 finished send and receive with processor2
Processor0 will send and receive with processor3
Processor0 finished send and receive with processor3
Processor3 finished send and receive with processor0
Processor1 finished send and receive with processor2
Processor2 will send and receive with processor1
Processor2 finished send and receive with processor1
Processor0: I am very good, I save the hash in range 0 to 65
p: 4
Tp: 8.61754
Processor1 will send and receive with processor3
Processor3 will send and receive with processor1
Processor3 finished send and receive with processor1
Processor1 finished send and receive with processor3
Processor2 will send and receive with processor3
Processor1: I am very good, I save the hash in range 65 to 130
Processor2 finished send and receive with processor3
Processor3 will send and receive with processor2
Processor3 finished send and receive with processor2
Processor3: I am very good, I save the hash in range 195 to 260
Processor2: I am very good, I save the hash in range 130 to 195

这是Slurm下的输出,它不会像使用命令那样返回整个结果。

Processor0 will send and receive with processor1
Processor2 will send and receive with processor0
Processor3 will send and receive with processor0
Processor1 will send and receive with processor0
Processor0 finished send and receive with processor1
Processor1 finished send and receive with processor0
Processor0 will send and receive with processor2
Processor0 finished send and receive with processor2
Processor2 finished send and receive with processor0
Processor1 will send and receive with processor2
Processor0 will send and receive with processor3
Processor2 will send and receive with processor1
Processor2 finished send and receive with processor1
Processor2 will send and receive with processor3
Processor1 finished send and receive with processor2

这是我的Slurm.sh文件:我认为我在其中犯了一个错误,因为结果与命令之一不同,但是我不确定这一点...

#!/bin/bash

####### select partition (check CCR documentation)
#SBATCH --partition=general-compute --qos=general-compute

####### set memory that nodes provide (check CCR documentation, e.g., 32GB)
#SBATCH --mem=64000

####### make sure no other jobs are assigned to your nodes
#SBATCH --exclusive

####### further customizations
#SBATCH --job-name="a1"
#SBATCH --output=%j.stdout
#SBATCH --error=%j.stderr
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --time=12:00:00

mpiexec -np 4 ./a1

1 个答案:

答案 0 :(得分:0)

再次,回来解决我的问题。 我犯了一个愚蠢的错误,我为我的mpi代码使用了错误的slurm.sh。 正确的slurm.sh是:

#!/bin/bash

####### select partition (check CCR documentation)
#SBATCH --partition=general-compute --qos=general-compute

####### set memory that nodes provide (check CCR documentation, e.g., 32GB)
#SBATCH --mem=32000

####### make sure no other jobs are assigned to your nodes
#SBATCH --exclusive

####### further customizations
#SBATCH --job-name="a1"
#SBATCH --output=%j.stdout
#SBATCH --error=%j.stderr
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=12
#SBATCH --time=01:00:00

####### check modules to see which version of MPI is available
####### and use appropriate module if needed
module load intel-mpi/2018.3
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so

srun /.a1

我太傻了,这就是为什么我用Konan作为昵称...我希望我能变得聪明。