我有以下提交脚本名为“test.sub”:
#!/bin/bash
#SBATCH --workdir=./
#SBATCH -o test.out
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --requeue
#SBATCH --job-name=test
x=0
while [ $x -le 100 ]; do
echo "Test $x" >> test.out
sleep 100
x=$(($x+1))
done
当我提交此作业脚本时,作业确实开始了。但是,当我使用scontrol show job
检查作业的状态时,收到以下消息:
...
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
...
NumNodes=1 NumCPUs=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=64,node=1
这是否意味着作业使用64 cpus而不是作业脚本中指定的1?如果是这样,我该怎么做才能解决这个问题?我有以下SLRUM配置文件(/etc/slurm-llnl/slurm.conf):
ControlMachine=DDHP-P1-server
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6816
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6817
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
#StateSaveLocation=/var/lib/slurm-llnl/slurmctld
StateSaveLocation=/apps2/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#ClusterName=(null) NodeName=DDHP-P1-server slurmd: Considering each NUMA node as a socket
#CPUs=64 Boards=1 SocketsPerBoard=8 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=257940 TmpDisk=171660
#NodeName=DDHP-P1-server CPUs=64 RealMemory=264131 State=UNKNOWN
NodeName=DDHP-P1-server CPUs=64 Sockets=4 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=252000 State=UNKNOWN
PartitionName=debug Nodes=DDHP-P1-server Default=YES MaxTime=INFINITE State=UP
感谢您帮助我! :)
答案 0 :(得分:2)
问题在于
行SelectType=select/linear
在配置文件中。它指示Slurm将节点分配给作业。如果您希望Slurm分配核心,则需要
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
有关SelectTypeParameters