SLURM srun未在并行Ubuntu 18中运行MPI作业

时间:2019-03-12 13:41:12

标签: ubuntu-18.04 openmpi slurm

我似乎无法使用Slurm运行MPI作业。有任何帮助或建议吗?

运行基于本地家庭的小型集群以使用我的所有处理器。我正在使用18.04,并安装了库存的openmpi和slurm软件包。我有一个小的测试程序,用于显示我正在运行的内核。当我使用mpirun运行时,我得到:

$ mpirun -N 3 MPI_Hello
Process 1 on ubuntu18.davidcarter.ca, out of 3
Process 2 on ubuntu18.davidcarter.ca, out of 3
Process 0 on ubuntu18.davidcarter.ca, out of 3

使用srun运行时,我得到:

$ srun -n 3 MPI_Hello
Process 0 on ubuntu18.davidcarter.ca, out of 1
Process 0 on ubuntu18.davidcarter.ca, out of 1
Process 0 on ubuntu18.davidcarter.ca, out of 1

我已经用不同的参数(--mpi = pmi2,-mpi = openmpi等)完成了很多次,并且可以确认它运行了n个单线程作业,而不是运行具有n个并行线程的作业。 n倍的工作量,每份工作的预期资源的1 / n倍。

这是我的测试程序:

#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv) {
        int numprocs, rank, namelen;
        char processor_name[MPI_MAX_PROCESSOR_NAME];

        MPI_Init (&argc, &argv);
        MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Get_processor_name(processor_name, &namelen);

        printf("Process %d on %s, out of %d\n", rank, processor_name, numprocs);

        MPI_Finalize();
}

这是我的/etc/slurm-llnl/slurm.conf:

# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=ubuntu18.davidcarter.ca
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
MpiParams=ports=12000-12100
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
#
# COMPUTE NODES
#NodeName=compute[1-4] Sockets=1 CoresPerSocket=2 RealMemory=1900 State=UNKNOWN
#NodeName=compute[1-2] Sockets=1 CoresPerSocket=4 RealMemory=3800 State=UNKNOWN
NodeName=compute1 Sockets=8 CoresPerSocket=1 RealMemory=7900 State=UNKNOWN
NodeName=ubuntu18 Sockets=1 CoresPerSocket=3 RealMemory=7900 State=UNKNOWN

#
# Partitions
PartitionName=debug Nodes=ubuntu18 Default=YES MaxTime=INFINITE State=UP OverSubscribe=NO
PartitionName=batch Nodes=compute1 Default=NO MaxTime=INFINITE State=UP OverSubscribe=NO
PartitionName=prod Nodes=compute1,ubuntu18 Default=NO MaxTime=INFINITE State=UP OverSubscribe=NO

0 个答案:

没有答案