Question

我正在（非常）非均匀SLURM集群（版本2.6.6-2）上运行一些批处理程序，使用GNU 'parallel'进行分发。我遇到的问题是某些节点完成任务的速度比其他节点快得多，我最终遇到的情况包括，例如，一个分配4个节点的作业，但在模拟的一半时间内只使用1个

有没有管理员权限可以释放其中一个未使用的节点？我可以通过在单个节点上运行4个作业，或者使用包含同类节点列表的文件来缓解这个问题，但它仍然远非理想状态。

供参考，以下是我正在使用的脚本文件（改编自here）

job.sh

#!/bin/sh

#SBATCH --job-name=test
#SBATCH --time=96:00:00
#SBATCH --ntasks=16
#SBATCH --mem-per-cpu=1024
#SBATCH --ntasks-per-node=4
#SBATCH --partition=normal

# --delay .2 prevents overloading the controlling node
# -j is the number of tasks parallel runs so we set it to $SLURM_NTASKS
# --joblog makes parallel create a log of tasks that it has already run
# --resume makes parallel use the joblog to resume from where it has left off
# the combination of --joblog and --resume allow jobs to be resubmitted if
# necessary and continue from where they left off
parallel="parallel --delay .2 -j $SLURM_NTASKS"
$parallel < command_list.sh

command_list.sh

srun --exclusive -N1 -n1 nice -19 ./a.out config0.dat
srun --exclusive -N1 -n1 nice -19 ./a.out config1.dat
srun --exclusive -N1 -n1 nice -19 ./a.out config2.dat

...

srun --exclusive -N1 -n1 nice -19 ./a.out config31.dat

Answer 1

您可以使用scontrol命令缩小工作量：

scontrol update JobId=# NumNodes=#

然而，我不确定Slurm如何选择节点来解散。您可能需要手动选择并编写

scontrol update JobId=# NodeList=<names>

参见Slurm FAQ中的问题24。

释放SLURM群集上未使用的已分配节点

1 个答案: