SLURM不遵循请求的资源

时间:2018-03-28 07:08:42

标签: ubuntu slurm

我有以下提交脚本名为“test.sub”:

#!/bin/bash
#SBATCH --workdir=./
#SBATCH -o test.out
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --requeue
#SBATCH --job-name=test

x=0

while [ $x -le 100 ]; do
   echo "Test $x" >> test.out
   sleep 100
   x=$(($x+1))
done

当我提交此作业脚本时,作业确实开始了。但是,当我使用scontrol show job检查作业的状态时,收到以下消息:

...
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
...
NumNodes=1 NumCPUs=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=64,node=1

这是否意味着作业使用64 cpus而不是作业脚本中指定的1?如果是这样,我该怎么做才能解决这个问题?我有以下SLRUM配置文件(/etc/slurm-llnl/slurm.conf):

ControlMachine=DDHP-P1-server
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6816
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6817
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
#StateSaveLocation=/var/lib/slurm-llnl/slurmctld
StateSaveLocation=/apps2/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log

#ClusterName=(null) NodeName=DDHP-P1-server slurmd: Considering each NUMA node as a socket
#CPUs=64 Boards=1 SocketsPerBoard=8 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=257940 TmpDisk=171660

#NodeName=DDHP-P1-server CPUs=64  RealMemory=264131 State=UNKNOWN
NodeName=DDHP-P1-server CPUs=64 Sockets=4 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=252000 State=UNKNOWN
PartitionName=debug Nodes=DDHP-P1-server Default=YES MaxTime=INFINITE State=UP

感谢您帮助我! :)

1 个答案:

答案 0 :(得分:2)

问题在于

SelectType=select/linear

在配置文件中。它指示Slurm将节点分配给作业。如果您希望Slurm分配核心,则需要

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

有关SelectTypeParameters

的替代选项,请参阅this documentation