Slurmctld在重新启动时清除“已取消批处理Jobid”的文件

时间:2018-12-13 20:27:14

标签: ubuntu-18.04 slurm

我的slurmctld不保存退出时(通过ctrl + c)在队列中的作业。

我给它提供了大约1000个作业,退出(ctrl + c),并在重新启动时指出每个作业(在本示例中为754)已失效并清除了该作业:

slurmctld: Purged files for defunct batch JobId=754

这是出口处的标准输出:

slurmctld: _job_complete: JobId=22 WEXITSTATUS 0
slurmctld: _job_complete: JobId=22 done
^Cslurmctld: Terminate signal (SIGINT or SIGTERM) received
slurmctld: Saving all slurm state
slurmctld: layouts: all layouts are now unloaded.

这是重新启动服务的标准输出:

jonathan@jonathan-ubuntudesktop:~$ sudo slurmctld -Dcv
slurmctld: slurmctld version 18.08.3 started on cluster jonathan-inspiron-13-7378
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 4
slurmctld: preempt/none loaded
slurmctld: ExtSensors NONE plugin loaded
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: No memory enforcing mechanism configured.
slurmctld: layouts: no layout to initialize
slurmctld: topology NONE plugin loaded
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: route default plugin loaded
slurmctld: layouts: loading entities/relations information
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 1 partitions
slurmctld: Purged files for defunct batch JobId=1183
slurmctld: Purged files for defunct batch JobId=1023
...
slurmctld: Purged files for defunct batch JobId=1384
slurmctld: Recovered state of 0 reservations
slurmctld: _preserve_plugins: backup_controller not specified
slurmctld: cons_res: select_p_reconfigure
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 1 partitions
slurmctld: Running as primary controller
slurmctld: No parameter for mcs plugin, default values set
slurmctld: mcs: MCSParameters = (null). ondemand set.
slurmctld: job_complete: invalid JobId=986
slurmctld: job_complete: invalid JobId=988
slurmctld: job_complete: invalid JobId=989
slurmctld: job_complete: invalid JobId=987

slurm.conf:

ControlAddr=192.168.1.2
AuthType=auth/munge
CryptoType=crypto/munge
MaxJobCount=1000000
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/home/jonathan/Documents/COMPANYNAME/slurmctl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/home/jonathan/Documents/COMPANYNAME/slurmctl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/home/jonathan/Documents/COMPANYNAME/slurmctl/save_state/slurmd
SlurmUser=jonathan
SlurmdUser=jonathan
StateSaveLocation=/home/jonathan/Documents/COMPANYNAME/slurmctl/save_state
SwitchType=switch/none
TaskPlugin=task/none
TaskPluginParam=Sched
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
SchedulerPort=7321
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=jonathan-Inspiron-13-7378
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmdDebug=3
NodeName=jonathan-Inspiron-13-7378 NodeAddr=192.168.1.4 CPUs=4 State=UNKNOWN
PartitionName=Grid0 Nodes=jonathan-Inspiron-13-7378 Default=YES MaxTime=INFINITE State=UP

“ / home / jonathan / Documents / COMPANYNAME / slurmctl / save_state”的所有者为jonathan:jonathan,并具有750个权限。

Slurm-18.08.3安装仅仅是基本的./configure、make和make install。

我在做什么错?谢谢您的帮助,非常感谢!

1 个答案:

答案 0 :(得分:0)

我是个白痴。我盲目地遵循了教程中的命令,而不是阅读每个标志的作用。

该问题是由-c标志引起的,因此我需要运行“ slurmctld -Dv”而不是“ slurmctld -Dvv”,这是其他人难得的机会...

干杯!