Question

我试图在mpi4py python脚本上运行16个实例：hello.py。我存储在s.txt 16这类命令中：

python /lustre/4_mpi4py/hello.py > 01.out

我通过这样的aprun命令在Cray集群中提交：

aprun -n 32 sh -c 'parallel -j 8 :::: s.txt'

我的意图是每个节点运行8个python作业。脚本运行超过3个小时，并且没有创建* .out文件。从PBS调度程序输出文件我得到这个：

Python version 2.7.3 loaded
aprun: Apid 11432669: Caught signal Terminated, sending to application
aprun: Apid 11432669: Caught signal Terminated, sending to application
parallel: SIGTERM received. No new jobs will be started.
parallel: SIGTERM received. No new jobs will be started.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: SIGTERM received. No new jobs will be started.
parallel: SIGTERM received. No new jobs will be started.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 02.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 06.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out
parallel: SIGTERM received. No new jobs will be started.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 10.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 08.out
parallel: SIGTERM received. No new jobs will be started.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out

我在一个节点上运行它，它有32个核心。我想我使用GNU并行命令是错误的。有人可以帮忙解决这个问题。

Answer 1

如https://portal.tacc.utexas.edu/documents/13601/1102030/4_mpi4py.pdf#page=8

中所列

from mpi4py import MPI

comm = MPI . COMM_WORLD

print " Hello ! I’m rank %02d from %02 d" % ( comm .rank , comm . size )

print " Hello ! I’m rank %02d from %02 d" % ( comm . Get_rank () ,
comm . Get_size () )

print " Hello ! I’m rank %02d from %02 d" %
( MPI . COMM_WORLD . Get_rank () , MPI . COMM_WORLD . Get_size () )

您的4_mpi4py/hello.py程序不是典型的单个进程（或单个python脚本），而是多进程MPI应用程序。

GNU parallel期望更简单的程序，并且不支持与MPI流程的交互。

在您的集群中有许多节点，每个节点可以启动不同数量的MPI进程（每个节点有2个8核CPU考虑变体：每个8个MPMP进程，每个8个OpenMP线程; 1个16个线程的MPI进程;没有线程的16个MPI进程）。为了向您的任务描述集群片段，集群管理软件与脚本使用的python MPI包装器使用的MPI库之间存在一些接口。管理层是aprun（和qsub？）：

http://www.nersc.gov/users/computational-systems/retired-systems/hopper/running-jobs/aprun/aprun-man-page/

https://www.nersc.gov/users/computational-systems/retired-systems/hopper/running-jobs/aprun/

您必须使用aprun命令在Hopper计算节点上启动作业。用于串行，MPI，OpenMP，UPC和混合MPI / OpenMP或混合MPI / CAF作业。

https://wickie.hlrs.de/platforms/index.php/CRAY_XE6_Using_the_Batch_System

XE6并行作业（MPI和OpenMP）的作业启动程序是aprun。 ...上面的aprun示例将启动并行可执行文件＆＃34; my_mpi_executable＆＃34;参数＆＃34; arg1＆＃34;和＆＃34; arg2＆＃34;。该作业将使用64个MPI进程启动，每个已分配的节点上放置32个进程（请记住，XE6系统中的节点由32个核心组成）。您需要在（qsub）之前由批处理系统分配节点。

aprun和qsub之间有一些接口和MPI：正常启动（aprun -n 32 python /lustre/4_mpi4py/hello.py）aprun只启动MPI程序的几（32）个进程，设置每个接口中的进程并为它们提供组ID（例如，使用环境变量，如PMI_ID;实际变量是特定于启动器/ MPI lib组合）。

GNU parallel没有与MPI程序的任何接口，它对这些变量一无所知。它的启动时间比预期的多8倍。并且您的错误命令中的所有32 * 8进程将具有相同的组ID;并且将有8个具有相同MPI进程ID的进程。它们会使你的MPI库行为不端。

永远不要将MPI资源管理器/启动器与古代的MPI之前的unix进程执行器混合，例如xargs或parallel或＆＃34;非常高级的bash脚本用于并行性＆＃34;。有MPI做并行的事情;并且有MPI启动器/作业管理（aprun，mpirun，mpiexec）用于启动多个进程/分叉/ ssh到机器。

不要aprun -n 32 sh -c 'parallel anything_with_MPI' - 这是不受支持的组合。只有aprun的可能（允许）参数是一些受支持的并行性的程序，如OpenMP，MPI，MPI + OpenMP或非并行程序。（或启动一个并行程序的单个脚本）

如果要启动多个独立的MPI任务，请使用aprun的多个参数：aprun -n 8 ./program_to_process_file1 : -n 8 ./program_to_process_file2 -n 8 ./program_to_process_file3 -n 8 ./program_to_process_file4

如果您要处理多个文件，请尝试启动多个并行作业，不要使用单个qsub，而应使用几个并允许PBS（或使用哪个作业管理器）来管理您的作业。

如果您的文件数量非常多，请尽量不要在程序中使用MPI（不要链接MPI库/包含MPI标题）并使用parallel或其他形式的古代并行，隐藏自aprun。或者直接在代码中使用单个MPI程序和程序文件分发（MPI的主进程可以打开文件列表，然后在其他MPI进程之间分发文件 - 有或没有MPI / mpi4py的动态进程管理：http://pythonhosted.org/mpi4py/usrman/tutorial.html#dynamic-process-management）。 / p>

一些科学家尝试将MPI和其他序列结合起来：parallel ... aprun ...或parallel ... mpirun ...：

如何在Cray XE6计算节点上使用带有aprun命令的GNU并行（bash脚本）（类似env的Unix）？

1 个答案: