我的目标是使用slurm依赖项构建一个管道,并处理一个slurm作业崩溃的情况。
根据以下answer和guide第29部分,建议使用scontrol requeue $jobID
,这将重新排队已取消的作业。
如果可以在提交脚本中检测到作业崩溃,并且 崩溃是随机的,你可以简单地用
scontrol requeue $SLURM_JOB_ID
重新排列作业,以便它再次运行。
在我重新排队取消的作业后,其依赖作业保持为DependencyNeverSatisfied
,即使是依赖作业也没有任何结果。如果取消的作业再次重新排队,有没有办法更新相关作业的状态?
示例:
$ sbatch run.sh
Submitted batch job 1
$ sbatch --dependency=aftercorr:1 run.sh
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
89 debug run.sh alper PD 0:00 1 (Dependency)
88 debug run.sh alper R 0:23 1 ebloc1
$ scancel 1
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
89 debug run.sh alper PD 0:00 1 (DependencyNeverSatisfied)
$ scontrol requeue 1
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
89 debug run.sh alper PD 0:00 1 (DependencyNeverSatisfied)
88 debug run.sh alper R 0:00 1 ebloc1
#After running job completed dependent job still remain as DependencyNeverSatisfied state:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
89 debug run.sh alper PD 0:00 1 (DependencyNeverSatisfied)
答案 0 :(得分:6)
在我重新排队取消的工作后,其依赖工作仍然是DependencyNeverSatisfied甚至依赖工作完成没有任何反应。如果取消的作业再次重新排队,有没有办法更新相关作业的状态?
是的,非常简单。使用scontrol
重置依赖关系。
scontrol update jobid = [dependent job id] dependency = after:[requeued job id]
我已经将此作为Slurm版本17.11的示例:
$ sbatch --begin=now+60 --wrap="exit 1"
Submitted batch job 540912
$ sbatch --dependency=afterok:540912 --wrap=hostname
Submitted batch job 540913
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
540912 debug wrap marshall PD 0:00 1 (BeginTime)
540913 debug wrap marshall PD 0:00 1 (Dependency)
$ scancel 540912
$ scontrol requeue 540912
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
540912 debug wrap marshall PD 0:00 1 (BeginTime)
540913 debug wrap marshall PD 0:00 1 (DependencyNeverSatisfied)
此时,我已经复制了你的情况。作业540912已被重新排队,而作业540913的原因是" DependencyNeverSatisfied"。
现在,您可以通过发出scontrol update job
:
$ scontrol update jobid=540913 dependency=after:540912
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
540912 debug wrap marshall PD 0:00 1 (BeginTime)
540913 debug wrap marshall PD 0:00 1 (Dependency)
状态是固定的!作业运行后,依赖作业也会运行:
$ scontrol update jobid=540912 starttime=now
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
540912 debug wrap marshall CG 0:00 1 v1
540913 debug wrap marshall PD 0:00 1 (Dependency)
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
squeue
的输出为空,因为作业已经完成。
您可以使用sacct
$ sacct -j 540912,540913
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
540912 wrap debug test 2 FAILED 1:0
540912.batch batch test 2 FAILED 1:0
540912.exte+ extern test 2 COMPLETED 0:0
540913 wrap debug test 2 COMPLETED 0:0
540913.batch batch test 2 COMPLETED 0:0
540913.exte+ extern test 2 COMPLETED 0:0