我有一个c ++求解器,我需要使用以下命令并行运行:
nohup mpirun -np 16 ./my_exec > log.txt &
此命令将在我的节点上可用的16个处理器上独立运行my_exec
。这曾经很完美。
上周,HPC部门进行了操作系统升级,现在,当启动相同的命令时,我收到两条警告消息(每个处理器)。第一个是:
--------------------------------------------------------------------------
2 WARNING: It appears that your OpenFabrics subsystem is configured to only
3 allow registering part of your physical memory. This can cause MPI jobs to
4 run with erratic performance, hang, and/or crash.
5
6 This may be caused by your OpenFabrics vendor limiting the amount of
7 physical memory that can be registered. You should investigate the
8 relevant Linux kernel module parameters that control how much physical
9 memory can be registered, and increase them to allow registering all
10 physical memory on your machine.
11
12 See this Open MPI FAQ item for more information on these Linux kernel module
13 parameters:
14
15 http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
16
17 Local host: tamnun
18 Registerable memory: 32768 MiB
19 Total memory: 98294 MiB
20
21 Your MPI job will continue, but may be behave poorly and/or hang.
22 --------------------------------------------------------------------------
23 --------------------------------------------------------------------------
然后我从我的代码中获取一个输出,它告诉我它认为我只启动了代码的1个实现(Nprocs
= 1而不是16)。
177
178 # MPI IS ON; Nprocs = 1
179 Filename = ../input/odtParam.inp
180
181 # MPI IS ON; Nprocs = 1
182
183 ***** Error, process 0 failed to create ../data/data_0/, or it was already there
最后,第二条警告信息是:
185 --------------------------------------------------------------------------
186 An MPI process has executed an operation involving a call to the
187 "fork()" system call to create a child process. Open MPI is currently
188 operating in a condition that could result in memory corruption or
189 other system errors; your MPI job may hang, crash, or produce silent
190 data corruption. The use of fork() (or system() or other calls that
191 create child processes) is strongly discouraged.
192
193 The process that invoked fork was:
194
195 Local host: tamnun (PID 17446)
196 MPI_COMM_WORLD rank: 0
197
198 If you are *absolutely sure* that your application will successfully
199 and correctly survive a call to fork(), you may disable this warning
200 by setting the mpi_warn_on_fork MCA parameter to 0.
201 --------------------------------------------------------------------------
在网上浏览后,我尝试通过以下命令将MCA
参数mpi_warn_on_fork
设置为0来遵循警告消息的建议:
nohup mpirun --mca mpi_warn_on_fork 0 -np 16 ./my_exec > log.txt &
产生以下错误消息:
[mpiexec@tamnun] match_arg (./utils/args/args.c:194): unrecognized argument mca
[mpiexec@tamnun] HYDU_parse_array (./utils/args/args.c:214): argument matching returned error
[mpiexec@tamnun] parse_args (./ui/mpich/utils.c:2964): error parsing input array
[mpiexec@tamnun] HYD_uii_mpx_get_parameters (./ui/mpich/utils.c:3238): unable to parse user arguments
我正在使用RedHat 6.7(圣地亚哥)。我联系了HPC部门,但由于我在大学,这个问题可能需要他们一两天才能回复。任何帮助或指导将不胜感激。
编辑回答:
实际上,我在使用英特尔mpic++
命令运行可执行文件的同时使用Open MPI mpirun
编译我的代码,因此错误(在操作系统升级之后,英特尔的mpirun
被设置为默认值)。我必须将Open MPI的mpirun
路径放在$PATH
环境变量的开头。
代码现在按预期运行但我仍然收到上面的第一条警告消息(它不建议我再使用MCA
参数mpi_warn_on_fork
。我认为(但不确定)它是一个问题我需要与HPC部门解决。
答案 0 :(得分:2)
[mpiexec@tamnun] match_arg (./utils/args/args.c:194): unrecognized argument mca
[mpiexec@tamnun] HYDU_parse_array (./utils/args/args.c:214): argument matching returned error
[mpiexec@tamnun] parse_args (./ui/mpich/utils.c:2964): error parsing input array
^^^^^
[mpiexec@tamnun] HYD_uii_mpx_get_parameters (./ui/mpich/utils.c:3238): unable to parse user arguments
^^^^^
您在最后一种情况下使用MPICH。 MPICH不是Open MPI,其进程启动程序无法识别特定于Open MPI的--mca
参数(MCA代表模块化组件体系结构 - Open MPI所基于的基本框架)。混淆多个MPI实现的典型案例。