我正在尝试使用aprun运行多节点作业。但是,我无法弄清楚如何在bash环境中获得排名(或任何作为每个职位的ID)。就像这个简单的工作:
aprun -n 8 -N 2 ./examplebashscript.sh
如何获得每个衍生作业的排名? 如果没有排名或任何唯一的作业ID,这个aprun行将只运行完全相同的程序16次,这是不可取的。
我一直在阅读文档,令人惊讶的是我无法找到任何解释aprun提供的默认变量的内容。
之前我曾与mpirun合作,我知道如何使用C和Python程序获取每个作业的等级值,但不是在Bash中。 aprun甚至没有记录。
答案 0 :(得分:1)
这样做的一种方法是编写一个包装器脚本,该脚本可以运行任务列表,然后将每个任务生成到一个单独的脚本中。
在您的片段中,您似乎希望每个计算节点运行2个脚本实例以获得总共8个,因此,在您的作业脚本中,您可以执行以下操作:
for (( i=0; i<8; i+=2 )); do
aprun -n 1 ./wrapper.sh $i 2 &
done
wait
然后在包装器中你可以做类似的事情(其中$ j为你提供了唯一的索引):
end=$(( $1 + $2 ))
for (( j=$1; j<$end; j+=1 )); do
./examplebashscript.sh $j &
done
wait
您还可以设置以下环境变量以获取各种进程和线程的位置。您需要在运行&#34; aprun&#34;之前在shell(或作业脚本)中设置它们:
export MPICH_CPUMASK_DISPLAY=1
export MPICH_RANK_REORDER_DISPLAY=1
例如,运行:
aprun -n 24 ./examplebashscript.sh
(简写等同于):
aprun -n 24 -N 24 -S 12 -d 1 ./examplebashscript.sh
将在STDERR上为您提供以下类型的输出(请注意,这是在XC30上,每个计算节点有两个Intel Ivy Bridge 12核处理器,因此掩码显示由于存在超线程而在每个节点48个核心上放置):
[PE_0]: MPI rank order: Using default aprun rank ordering.
[PE_0]: rank 0 is on nid02749
[PE_0]: rank 1 is on nid02749
[PE_0]: rank 2 is on nid02749
[PE_0]: rank 3 is on nid02749
[PE_0]: rank 4 is on nid02749
[PE_0]: rank 5 is on nid02749
[PE_0]: rank 6 is on nid02749
[PE_0]: rank 7 is on nid02749
[PE_0]: rank 8 is on nid02749
[PE_0]: rank 9 is on nid02749
[PE_0]: rank 10 is on nid02749
[PE_0]: rank 11 is on nid02749
[PE_0]: rank 12 is on nid02749
[PE_0]: rank 13 is on nid02749
[PE_0]: rank 14 is on nid02749
[PE_0]: rank 15 is on nid02749
[PE_0]: rank 16 is on nid02749
[PE_0]: rank 17 is on nid02749
[PE_0]: rank 18 is on nid02749
[PE_0]: rank 19 is on nid02749
[PE_0]: rank 20 is on nid02749
[PE_0]: rank 21 is on nid02749
[PE_0]: rank 22 is on nid02749
[PE_0]: rank 23 is on nid02749
[PE_23]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000100000000000000000000000
[PE_22]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000010000000000000000000000
[PE_21]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000001000000000000000000000
[PE_0]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000000001
[PE_20]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000100000000000000000000
[PE_9]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000001000000000
[PE_11]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000100000000000
[PE_10]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000010000000000
[PE_8]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000100000000
[PE_1]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000000010
[PE_2]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000000100
[PE_18]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000001000000000000000000
[PE_7]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000010000000
[PE_15]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000001000000000000000
[PE_3]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000001000
[PE_6]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000001000000
[PE_16]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000010000000000000000
[PE_14]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000100000000000000
[PE_13]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000010000000000000
[PE_12]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000001000000000000
[PE_4]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000010000
[PE_5]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000100000
[PE_17]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000100000000000000000
[PE_19]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000010000000000000000000
您可能能够以某种方式捕获它。
答案 1 :(得分:1)
尝试在您已经过修改的bash脚本中查找环境变量 ALPS_APP_PE 。
脚本的每个实例都会有所不同(创建的实例数由aprun命令中的-n选项给出)。
如果脚本随后执行MPI程序的实例,则该实例将具有ALPS_APP_PE给出的MPI等级值。
需要注意的是,一些Cray网站可能决定不公开此变量,或使用其他名称。非常旧的ALPS版本也不支持它,但这些很少见。
请参阅此CUG 2014论文以获取示例:
https://cug.org/proceedings/cug2014_proceedings/includes/files/pap136.pdf
答案 2 :(得分:0)
假设你在最近的Cray上运行,你就不能。您的脚本在登录节点上执行,PDO_MySQL
命令在计算节点上启动应用程序。
您启动的应用程序可以通过初始化MPI然后调用aprun
来获得排名。