程序因openmpi挂起

时间:2019-03-06 13:37:07

标签: bash parallel-processing openmpi hpc mpich

我正在并行运行 使用 openmpi 2.0.2 ,我总是在输出文件中收到以下消息:

Warning: Permanently added the RSA host key for IP address '10.4.12.75' to the list of known hosts.^M
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           hpc488
  Local device:         mlx4_0
  Local port:           2
  CPCs attempted:       rdmacm, udcm
----------------------------------------

后跟错误消息:-

[hpc488:45221] 39 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[hpc488:45221] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

我正在使用的脚本如下:-

#!/bin/bash
#PBS -N test
#PBS -q medium
#PBS -l nodes=4:ppn=20
cd $PBS_O_WORKDIR
export I_MPI_FABRICS shm:dapl
export I_MPI_MPD_TMPDIR /scratch/$USER
mpirun -machinefile $PBS_NODEFILE -np 80 ~/test/cp2k-5.1.0/exe/local/cp2k.popt -i ATP-1.restart >& out

程序也会在一段时间后挂起,例如:-

开头的输出:-

          ----------------------------------- OT ---------------------------------------

  Step     Update method      Time    Convergence         Total energy    

    Change
      ------------------------------------------------------------------------------
         1 OT DIIS     0.80E-01   54.3     0.00002715     -8803.0497995708 -8.80E+03
         2 OT DIIS     0.80E-01   18.8     0.00005469     -8803.0494664995  3.33E-04
         3 OT DIIS     0.80E-01   19.0     0.00001678     -8803.0507564351 -1.29E-03
         4 OT DIIS     0.80E-01   18.9     0.00001380     -8803.0508931318 -1.37E-04
         5 OT DIIS     0.80E-01   19.0     0.00000619     -8803.0510930570 -2.00E-04

      *** SCF run converged in     5 steps ***

一段时间后输出:-

 ----------------------------------- OT ---------------------------------------

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 OT DIIS     0.80E-01  543.5     0.00005264     -8803.0309338155 -8.80E+03
     2 OT DIIS     0.80E-01  129.1     0.00017122     -8803.0214844607  9.45E-03
     3 OT DIIS     0.80E-01   97.0     0.00001549     -8803.0324199550 -1.09E-02
     4 OT DIIS     0.80E-01  104.3     0.00001280     -8803.0325293227 -1.09E-04
     5 OT DIIS     0.80E-01  108.0     0.00000682     -8803.0327023147 -1.73E-04

  *** SCF run converged in     5 steps ***

有人知道发生了什么吗?我非常感谢您的帮助。

0 个答案:

没有答案