我想在使用SLURM来管理作业的群集上训练神经网络。每个提交的作业有10小时的时间限制。因此,我需要一个脚本,该脚本可以自动提交连续的工作,即从头开始为第一份工作训练,并从最新的工作重新加载检查点以随后继续为第二份工作进行训练。
我写了以下脚本。我想知道这是否可行,或者是否有任何标准方法可以在SLURM中进行处理。
#!/bin/bash
Njobs=1000
# Read the configuration variables
# Each training should have a difference config
CONFIG=experiments/model.cfg
source $CONFIG
# Submit first job - no dependencies
j0=$(sbatch run-debug.slurm $CONFIG)
echo "ID of the first job: $j0"
# add first job to the list of jobs
jIDs+=($j0)
# for loop: submit Njobs: where job (i+1) is dependent on job i.
# and job (i+1) (i.e. new_job) resume from the checkpoint of job i
for i in $(seq 0 $Njobs); do
# Submit job (i+1) with dependency ('afterok:') on job i
RESUME_CHECKPOINT=$OUTPUTPATH/$EXPNAME/${jIDs[$i - 1 ]}/checkpoint.pkl
new_job=$(sbatch --dependency=afterok:${jIDs[$i - 1 ]} run-debug.slurm $CONFIG $RESUME_CHECKPOINT)
echo "Submitted job $new_job that will be executed once job ${jIDs[$i - 1 ]} has completed with success."
echo "This task will resume training from $RESUME_CHECKPOINT."
jIDs+=($new_job)
echo "List of jobs that have been submitted: $jIDs"
done
非常感谢您的帮助!