Question

我需要转换特定目录中的每个文件，然后在使用slurm的系统上将结果编译为单个计算。每个单独文件上的工作大约需要其余集体计算时间。因此，我希望各个转换同时发生。因此，这是我需要做的：

main.sh

#!/bin/bash
#SBATCH --account=millironx
#SBATCH --time=1-00:00:00
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=4

find . -maxdepth 1 -name "*.input.txt" \
  -exec ./convert-files.sh {} \;

./compile-results.sh *.output.txt

./compute.sh

echo "All Done!"

convert-files.sh

#!/bin/bash
# Simulate a time-intensive process
INPUT=${1%}
OUTPUT="${$INPUT/input.txt/output.txt}"
sleep 10
date > $OUTPUT

虽然该系统可以运行，但我通常处理30多个文件的批处理，并且计算时间超出了管理员仅使用一个节点时设置的时间限制。 我如何并行处理文件，然后在它们全部处理完毕后对其进行编译和计算？

我尝试过/考虑过的事情

将srun添加到`find -exec`

find . -maxdepth 1 -name "*.input.txt" \
  -exec srun -n1 -N1 --exclusive ./convert-files.sh {} \;

find -exec waits for blocking processes和srun is blocking，所以这和基本代码在时间上完全一样。

在提交脚本中使用摘要

find . -maxdepth 1 -name "*.input.txt" \
  -exec sbatch ./convert-files.sh {} \;

这不会在开始计算之前等待转换完成，因此会失败。

使用GNU并行

find . -maxdepth 1 -name "*.input.txt" | \
  parallel ./convert-files.sh

OR

find . -maxdepth 1 -name "*.input.txt" | \
  parallel srun -n1 -N1 --exclusive ./convert-files.sh

parallel只能“查看”当前节点上的CPU数量，因此一次只能处理四个文件。更好，但仍然不是我想要的。

使用作业数组

This method sounds promising，但由于要处理的文件名称中没有序号，因此我无法找到一种使之工作的方法。

使用sbatch单独提交作业

在终端：

$ find . -maxdepth 1 -name "*.input.txt" \
>  -exec sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>  ./convert-files.sh {} \;

五个小时后：

$ srun --account=millironx --time=30:00 --cpus-per-task=4 \
>   ./compile-results.sh *.output.txt & \
>   sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>   ./compute.sh

这是我到目前为止提出的最佳策略，但这意味着我必须记住检查转换批处理的进度，并在完成转换后立即开始计算。

使用具有依赖项的sbatch

在终端：

$ find . -maxdepth 1 -name "*.input.txt" \
>  -exec sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>  ./convert-files.sh {} \;
Submitted job xxxx01
Submitted job xxxx02
...
Submitted job xxxx45
$ sbatch --account=millironx --time=30:00 --cpus-per-task=4 \
>   --dependency=after:xxxx45 --job-name=compile_results \
>   ./compile-results.sh *.output.txt & \
>   sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>   --dependency=after:compile_results \
>   ./compute.sh

我还不敢尝试这个，因为我知道最后的工作并不能保证最后完成。

这似乎应该是一件容易的事，但是我还没有弄清楚。

Answer 1

如果您的$SLURM_NODELIST包含与node1,node2,node34类似的内容，则可能可行：

find ... | parallel -S $SLURM_NODELIST convert_files

Answer 2

可能会遵循的find . -maxdepth 1 -name "*.input.txt" | parallel srun -n1 -N1 --exclusive ./convert-files.sh方法。但是似乎./convert-files.sh希望使用文件名作为参数，并且您正尝试通过管道将其推送到stdin。您需要使用xargs，并且由于xargs可以并行工作，因此不需要parallel命令。

尝试：

find . -maxdepth 1 -name "*.input.txt" | xargs -L1 -P$SLURM_NTASKS srun -n1 -N1 --exclusive ./convert-files.sh

-L1将每行拆分find的结果，并将其馈送到convert.sh，一次生成最多$SLURM_NTASKS个进程，并“发送”每个进程感谢srun -n1 -N1 --exclusive，将Slurm分配给节点上的CPU。

并行处理文件组，然后使用Slurm

我尝试过/考虑过的事情

将srun添加到`find -exec`

在提交脚本中使用摘要

使用GNU并行

使用作业数组

使用sbatch单独提交作业

使用具有依赖项的sbatch

2 个答案:

并行处理文件组，然后使用Slurm

我尝试过/考虑过的事情

将srun添加到find -exec

在提交脚本中使用摘要

使用GNU并行

使用作业数组

使用sbatch单独提交作业

使用具有依赖项的sbatch

2 个答案:

将srun添加到`find -exec`