使用SLURM的openMP作业进行分段错误

时间:2014-04-18 14:51:09

标签: segmentation-fault cluster-computing openmpi infiniband slurm

尝试使用slurm SBATCH作业或使用MPI超过infiniband的SRUN作业时出现问题。

安装了OpenMPI,如果我使用mpirun -n 30 ./hello启动以下测试程序(名为 hello ),则可以正常运行。

// compilation: mpicc -o helloMPI helloMPI.c
#include <mpi.h>
#include <stdio.h>
int main ( int argc, char * argv [] )
{
   int myrank, nproc;
   MPI_Init ( &argc, &argv );
   MPI_Comm_size ( MPI_COMM_WORLD, &nproc );
   MPI_Comm_rank ( MPI_COMM_WORLD, &myrank );
  printf ( "hello from rank %d of %d\n", myrank, nproc );
   MPI_Barrier ( MPI_COMM_WORLD );
   MPI_Finalize (); 
   return 0;
}

所以:

user@master:~/hello$ mpicc -o hello hello.c
user@master:~/hello$ mpirun -n 30 ./hello
--------------------------------------------------------------------------
[[5627,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: usNIC
  Host: master

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
hello from rank 25 of 30
hello from rank 1 of 30
hello from rank 6 of 30
[...]
hello from rank 17 of 30

当我尝试通过SLURM启动它时,我得到了这样的分段错误:

user@master:~/hello$ srun -n 20 ./hello
[node05:01937] *** Process received signal ***
[node05:01937] Signal: Segmentation fault (11)
[node05:01937] Signal code: Address not mapped (1)
[node05:01937] Failing at address: 0x30
[node05:01937] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fcf6bf7ecb0]
[node05:01937] [ 1] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x244c6)[0x7fcf679b64c6]
[node05:01937] [ 2] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x254cb)[0x7fcf679b74cb]
[node05:01937] [ 3] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0xb1)[0x7fcf679b2141]
[node05:01937] [ 4] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x10ad0)[0x7fcf679a2ad0]
[node05:01937] [ 5] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_btl_base_select+0x114)[0x7fcf6c209b34]
[node05:01937] [ 6] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7fcf67bca652]
[node05:01937] [ 7] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_bml_base_init+0x69)[0x7fcf6c209359]
[node05:01937] [ 8] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_pml_ob1.so(+0x5975)[0x7fcf65d1b975]
[node05:01937] [ 9] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_pml_base_select+0x35c)[0x7fcf6c21a0bc]
[node05:01937] [10] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(ompi_mpi_init+0x4ed)[0x7fcf6c1cb89d]
[node05:01937] [11] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(MPI_Init+0x16b)[0x7fcf6c1eb56b]
[node05:01937] [12] /home/user/hello/./hello[0x400826]
[node05:01937] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fcf6bbd076d]
[node05:01937] [14] /home/user/hello/./hello[0x400749]
[node05:01937] *** End of error message ***
[node05:01938] *** Process received signal ***
[node05:01938] Signal: Segmentation fault (11)
[node05:01938] Signal code: Address not mapped (1)
[node05:01938] Failing at address: 0x30
[node05:01940] *** Process received signal ***
[node05:01940] Signal: Segmentation fault (11)
[node05:01940] Signal code: Address not mapped (1)
[node05:01940] Failing at address: 0x30
[node05:01938] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f68b2e10cb0]
[node05:01938] [ 1] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x244c6)[0x7f68ae8484c6]
[node05:01938] [ 2] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x254cb)[0x7f68ae8494cb]
[node05:01940] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f8af1d82cb0]
[node05:01940] [ 1] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x244c6)[0x7f8aed7ba4c6]
[node05:01940] [ 2] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x254cb)[0x7f8aed7bb4cb]
[node05:01940] [ 3] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0xb1)[0x7f8aed7b6141]
[node05:01940] [ 4] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x10ad0)[0x7f8aed7a6ad0]
[node05:01938] [ 3] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0xb1)[0x7f68ae844141]
[node05:01938] [ 4] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x10ad0)[0x7f68ae834ad0]
[node05:01938] [ 5] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_btl_base_select+0x114)[0x7f68b309bb34]
[node05:01938] [ 6] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7f68aea5c652]
[node05:01940] [ 5] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_btl_base_select+0x114)[0x7f8af200db34]
[node05:01940] [ 6] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7f8aed9ce652]
[node05:01938] [ 7] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_bml_base_init+0x69)[0x7f68b309b359]
[node05:01938] [ 8] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_pml_ob1.so(+0x5975)[0x7f68acbad975]
[node05:01940] [ 7] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_bml_base_init+0x69)[0x7f8af200d359]
[node05:01940] [ 8] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_pml_ob1.so(+0x5975)[0x7f8aebb1f975]
[node05:01940] [ 9] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_pml_base_select+0x35c)[0x7f8af201e0bc]
[node05:01938] [ 9] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_pml_base_select+0x35c)[0x7f68b30ac0bc]
[node05:01938] [10] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(ompi_mpi_init+0x4ed)[0x7f68b305d89d]
[node05:01940] [10] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(ompi_mpi_init+0x4ed)[0x7f8af1fcf89d]
[node05:01938] [11] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(MPI_Init+0x16b)[0x7f68b307d56b]
[node05:01938] [12] /home/user/hello/./hello[0x400826]
[node05:01940] [11] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(MPI_Init+0x16b)[0x7f8af1fef56b]
[node05:01940] [12] /home/user/hello/./hello[0x400826]
[node05:01938] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f68b2a6276d]
[node05:01938] [14] /home/user/hello/./hello[0x400749]
[node05:01938] *** End of error message ***
[node05:01940] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f8af19d476d]
[node05:01940] [14] /home/user/hello/./hello[0x400749]
[node05:01940] *** End of error message ***
[node05:01939] *** Process received signal ***
[node05:01939] Signal: Segmentation fault (11)
[node05:01939] Signal code: Address not mapped (1)
[node05:01939] Failing at address: 0x30
[...]etc

有谁知道这是什么问题?我已经构建了具有Slurm支持的openMPI,并安装了相同版本的编译器和库,实际上所有的库都在一个安装在每个节点上的NFS共享中。

说明:

它应该使用infiniband,因为它已安装。但是当我用 mpirun 来表达时,我注意到了

[[5627,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: usNIC
  Host: cluster

我猜这意味着&#34;没有通过infiniband&#34;。我已经安装了infiniband驱动程序,并在Infiniband上设置了IP。 Slurm配置为使用infiniband IP运行:它是正确的配置吗?

提前致谢 最好的问候

编辑:

我刚尝试使用MPICH2而不是openMPI编译它,它可以与SLURM一起使用。所以问题可能与openMPI有关,而不是Slurm配置?

编辑2: 实际上,我已经看到使用openMPI 1.6.5(而不是1.8)使用SBATCH命令而不是SRUN我的脚本被执行(即它返回线程号,等级和主机。但是它显示了与openfabric供应商相关的警告和分配注册记忆:

The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory.  This typically can indicate that the
memlock limits are set too low.  For most HPC installations, the
memlock limits should be set to "unlimited".  The failure occured
here:

  Local host:    node05
  OMPI source:   btl_openib_component.c:1216
  Function:      ompi_free_list_init_ex_new()
  Device:        mlx4_0
  Memlock limit: 65536

You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   node05
  Local device: mlx4_0
--------------------------------------------------------------------------
Hello world from process 025 out of 048, processor name node06
Hello world from process 030 out of 048, processor name node06
Hello world from process 032 out of 048, processor name node06
Hello world from process 046 out of 048, processor name node07
Hello world from process 031 out of 048, processor name node06
Hello world from process 041 out of 048, processor name node07
Hello world from process 034 out of 048, processor name node06
Hello world from process 044 out of 048, processor name node07
Hello world from process 033 out of 048, processor name node06
Hello world from process 045 out of 048, processor name node07
Hello world from process 026 out of 048, processor name node06
Hello world from process 043 out of 048, processor name node07
Hello world from process 024 out of 048, processor name node06
Hello world from process 038 out of 048, processor name node07
Hello world from process 014 out of 048, processor name node05
Hello world from process 027 out of 048, processor name node06
Hello world from process 036 out of 048, processor name node07
Hello world from process 019 out of 048, processor name node05
Hello world from process 028 out of 048, processor name node06
Hello world from process 040 out of 048, processor name node07
Hello world from process 023 out of 048, processor name node05
Hello world from process 042 out of 048, processor name node07
Hello world from process 018 out of 048, processor name node05
Hello world from process 039 out of 048, processor name node07
Hello world from process 021 out of 048, processor name node05
Hello world from process 047 out of 048, processor name node07
Hello world from process 037 out of 048, processor name node07
Hello world from process 015 out of 048, processor name node05
Hello world from process 035 out of 048, processor name node06
Hello world from process 020 out of 048, processor name node05
Hello world from process 029 out of 048, processor name node06
Hello world from process 016 out of 048, processor name node05
Hello world from process 017 out of 048, processor name node05
Hello world from process 022 out of 048, processor name node05
Hello world from process 012 out of 048, processor name node05
Hello world from process 013 out of 048, processor name node05
Hello world from process 000 out of 048, processor name node04
Hello world from process 001 out of 048, processor name node04
Hello world from process 002 out of 048, processor name node04
Hello world from process 003 out of 048, processor name node04
Hello world from process 006 out of 048, processor name node04
Hello world from process 009 out of 048, processor name node04
Hello world from process 011 out of 048, processor name node04
Hello world from process 004 out of 048, processor name node04
Hello world from process 007 out of 048, processor name node04
Hello world from process 008 out of 048, processor name node04
Hello world from process 010 out of 048, processor name node04
Hello world from process 005 out of 048, processor name node04
[node04:04390] 47 more processes have sent help message help-mpi-btl-openib.txt / init-fail-no-mem
[node04:04390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[node04:04390] 47 more processes have sent help message help-mpi-btl-openib.txt / error in device init

我从中理解的是,a)v.1.6.5有更好的错误处理和b)我必须配置更多注册内存大小的openMPI和/或infiniband驱动程序。我看到this page,显然我只需要修改openMPI的东西?我得测试一下......

2 个答案:

答案 0 :(得分:0)

两件事:对于&#34; srun ... mpi_app&#34;,您需要在OMPI中做特殊事情。有关如何在SLURM下运行Open MPI作业,请参阅http://www.open-mpi.org/faq/?category=slurm

usnic消息似乎是一个合法的错误报告,您应该提交给Open MPI用户的邮件列表:

http://www.open-mpi.org/community/lists/ompi.php

特别是,我希望看到一些详细信息,以便找出您收到有关usNIC的警告消息的原因(我猜测您并未在Cisco UCS平台上运行已安装usNIC,但如果您安装了IB,则不应该看到此消息。

答案 1 :(得分:0)

  1. 我的解决方案:升级到Slurm 14.03.2-1,OpenMPI 1.8.1。

  2. 奇怪的是,我在我的一些节点(btl openib上的段错误)之后遇到了这个问题 Infiniband网络重组。我使用的是Slurm 2.6.9和OpenMPI 1.8。

  3. 在配备戴尔/ AMD Opteron / Mellanox的机架上,它会发生段错误(之前它正在运行) 网络重组。)

    带有HP / Intel / Mellanox的机架在重组前后继续工作。

    这可能与Infiniband拓扑有关。