mpirun -np 4 ~/opt/stuff/OSMC
有时(执行取决于许多随机值)四个进程中的一个死掉:
Image PC Routine Line Source
OSMC 000000000050B54D Unknown Unknown Unknown
OSMC 000000000050A055 Unknown Unknown Unknown
OSMC 00000000004BA320 Unknown Unknown Unknown
OSMC 000000000047976F Unknown Unknown Unknown
OSMC 0000000000479B72 Unknown Unknown Unknown
OSMC 000000000043B7DC mpi_m_mp_exchange 306 mpi_m.f90
OSMC 0000000000430880 mpi_m_mp_coagulat 85 mpi_m.f90
OSMC 000000000041304B op_m_mp_op_run_ 81 op_m.f90
OSMC 000000000040FF22 osmc_m_mp_run_ 543 OSMC_m.f90
OSMC 000000000040FD09 MAIN__ 28 OSMC_m.f90
OSMC 000000000040FC4C Unknown Unknown Unknown
libc.so.6 000000362081ED5D Unknown Unknown Unknown
OSMC 000000000040FB49 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 28468 on
node rcfen04 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
系统没有打印核心转储,因此除了这个简短的摘要之外,我没有其他信息。我查看了mpi_m.f90第306行,其中现有数组设置为0。 系统应该能够打印核心转储文件,因为:
[user@host path]$ ulimit -a
core file size (blocks, -c) unlimited
...
这是简短摘要中报告的一段代码:
module mpi_m
implicit none
...
real(wp),allocatable :: part(:,:) ! ARRAY DECLARATION
...
allocate( part_(pdim,is_:ie_) ) ! ARRAY ALLOCATION
...
subroutine exchanger_compute_bij(ierr,msg)
implicit none
...
part = 0.0_wp ! HERE CODE CRASHES
...
end subroutine
...
end module
对我来说似乎没有错。有罪的指令是fortran矢量操作,应该没问题。即使我使用绑定检查进行编译,它也会崩溃。
如何确定此次突发事故的原因?我希望给Totalview或其他一些调试器的核心转储文件本来可以帮助..