Question

我想在群集上的单个节点上运行多个作业。但是，当我提交作业时，它会占用所有可用的CPU，因此剩余的作业会排队。作为一个例子，我创建了一个脚本，它请求少量资源并提交两个应该同时运行的作业。

    #! /bin/bash
    variable=$(seq 0 1 1)
    for l in ${variable}
    do

    run_thread="./run_thread.sh"
    cat << EOF >  ${run_thread}
    #! /bin/bash
    #SBATCH -p normal 
    #SBATCH --nodes 1 
    #SBATCH --cpus-per-task 1
    #SBATCH --ntasks 1 
    #SBATCH --threads-per-core 1
    #SBATCH --mem=10G

    sleep 120

    EOF
    sbatch ${run_thread}
    done

但是，一个作业正在运行而另一个用户正在等待：

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
57    normal run_thre  user PD       0:00      1 (Resources)
56    normal run_thre  user  R       0:02      1 node00

集群只有一个节点，带有4个插槽，每个插槽有12个内核和2个线程。命令scontrol show jobid #job的输出如下：

    JobId=56 JobName=run_thread.sh
       UserId=user(1002) GroupId=user(1002) MCS_label=N/A
       Priority=4294901755 Nice=0 Account=(null) QOS=(null)
       JobState=RUNNING Reason=None Dependency=(null)
       Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
       RunTime=00:00:51 TimeLimit=UNLIMITED TimeMin=N/A
       SubmitTime=2018-03-24T15:34:46 EligibleTime=2018-03-24T15:34:46
       StartTime=2018-03-24T15:34:46 EndTime=Unknown Deadline=N/A
       PreemptTime=None SuspendTime=None SecsPreSuspend=0
       Partition=normal AllocNode:Sid=node00:13047
       ReqNodeList=(null) ExcNodeList=(null)
       NodeList=node00
       BatchHost=node00
       NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
       TRES=cpu=48,mem=10G,node=1
       Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
       MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
       Features=(null) DelayBoot=00:00:00
       Gres=(null) Reservation=(null)
       OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
       Command=./run_thread.sh
       WorkDir=/home/user
       StdErr=/home/user/slurm-56.out
       StdIn=/dev/null
       StdOut=/home/user/slurm-56.out
       Power=

scontrol show partition的输出是：

    PartitionName=normal
       AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
       AllocNodes=ALL Default=YES QoS=N/A
       DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
       MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
       Nodes=node00
       PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
       OverTimeLimit=NONE PreemptMode=OFF
       State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
       DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

SLURM系统有一些我无法获得的东西。如何在每个作业中仅使用1个CPU并同时在节点上运行48个作业？

Answer 1

Slurm is probably configured with

SelectType=select/linear

which means that slurm allocates full nodes to jobs and does not allow node sharing among jobs.

You can check with

scontrol show config | grep SelectType

Set a value of select/cons_res to allow node sharing.

在SLURM上指定作业的CPU数

1 个答案: