mpirun:一个进程终止但不打印核心转储

时间:2015-11-04 18:33:22

标签: crash fortran mpi fortran90 core

伙计,我在一个非常奇怪的问题上磕磕绊绊。我正在用mpirun命令运行一个工作:

   mpirun -np 4 ~/opt/stuff/OSMC

有时(执行取决于许多随机值)四个进程中的一个死掉:

  Image              PC                Routine            Line        Source             
  OSMC               000000000050B54D  Unknown               Unknown  Unknown
  OSMC               000000000050A055  Unknown               Unknown  Unknown
  OSMC               00000000004BA320  Unknown               Unknown  Unknown
  OSMC               000000000047976F  Unknown               Unknown  Unknown
  OSMC               0000000000479B72  Unknown               Unknown  Unknown
  OSMC               000000000043B7DC  mpi_m_mp_exchange         306  mpi_m.f90
  OSMC               0000000000430880  mpi_m_mp_coagulat          85  mpi_m.f90
  OSMC               000000000041304B  op_m_mp_op_run_            81  op_m.f90
  OSMC               000000000040FF22  osmc_m_mp_run_            543  OSMC_m.f90
  OSMC               000000000040FD09  MAIN__                     28  OSMC_m.f90
  OSMC               000000000040FC4C  Unknown               Unknown  Unknown
  libc.so.6          000000362081ED5D  Unknown               Unknown  Unknown
  OSMC               000000000040FB49  Unknown               Unknown  Unknown
  --------------------------------------------------------------------------
  mpirun has exited due to process rank 1 with PID 28468 on
  node rcfen04 exiting without calling "finalize". This may
  have caused other processes in the application to be
  terminated by signals sent by mpirun (as reported here).
  --------------------------------------------------------------------------

系统没有打印核心转储,因此除了这个简短的摘要之外,我没有其他信息。我查看了mpi_m.f90第306行,其中现有数组设置为0。 系统应该能够打印核心转储文件,因为:

  [user@host path]$ ulimit -a
  core file size          (blocks, -c) unlimited
  ...

这是简短摘要中报告的一段代码:

  module mpi_m
    implicit none
    ...
    real(wp),allocatable :: part(:,:) ! ARRAY DECLARATION
    ...
    allocate( part_(pdim,is_:ie_) )   ! ARRAY ALLOCATION
    ...
    subroutine exchanger_compute_bij(ierr,msg)
    implicit none
    ...
    part = 0.0_wp                     ! HERE CODE CRASHES
    ...
    end subroutine
    ...
  end module

对我来说似乎没有错。有罪的指令是fortran矢量操作,应该没问题。即使我使用绑定检查进行编译,它也会崩溃。

如何确定此次突发事故的原因?我希望给Totalview或其他一些调试器的核心转储文件本来可以帮助..

0 个答案:

没有答案