Question

我想在使用SLURM来管理作业的群集上训练神经网络。每个提交的作业有10小时的时间限制。因此，我需要一个脚本，该脚本可以自动提交连续的工作，即从头开始为第一份工作训练，并从最新的工作重新加载检查点以随后继续为第二份工作进行训练。

我写了以下脚本。我想知道这是否可行，或者是否有任何标准方法可以在SLURM中进行处理。

#!/bin/bash

Njobs=1000

# Read the configuration variables
# Each training should have a difference config
CONFIG=experiments/model.cfg
source $CONFIG

# Submit first job - no dependencies
j0=$(sbatch run-debug.slurm $CONFIG)
echo "ID of the first job: $j0"

# add first job to the list of jobs
jIDs+=($j0)

# for loop: submit Njobs: where job (i+1) is dependent on job i.
# and job (i+1) (i.e. new_job) resume from the checkpoint of job i
for i in $(seq 0 $Njobs); do
    # Submit job (i+1) with dependency ('afterok:') on job i
    RESUME_CHECKPOINT=$OUTPUTPATH/$EXPNAME/${jIDs[$i - 1 ]}/checkpoint.pkl
    new_job=$(sbatch  --dependency=afterok:${jIDs[$i - 1 ]} run-debug.slurm $CONFIG $RESUME_CHECKPOINT)
    echo "Submitted job $new_job that will be executed once job ${jIDs[$i - 1 ]} has completed with success."
    echo "This task will resume training from $RESUME_CHECKPOINT."
    jIDs+=($new_job)
    echo "List of jobs that have been submitted: $jIDs"
done

非常感谢您的帮助！

在SLURM中提交N个连续作业

0 个答案: