我正在并行运行 使用 openmpi 2.0.2 ,我总是在输出文件中收到以下消息:
Warning: Permanently added the RSA host key for IP address '10.4.12.75' to the list of known hosts.^M
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: hpc488
Local device: mlx4_0
Local port: 2
CPCs attempted: rdmacm, udcm
----------------------------------------
后跟错误消息:-
[hpc488:45221] 39 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[hpc488:45221] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
我正在使用的脚本如下:-
#!/bin/bash
#PBS -N test
#PBS -q medium
#PBS -l nodes=4:ppn=20
cd $PBS_O_WORKDIR
export I_MPI_FABRICS shm:dapl
export I_MPI_MPD_TMPDIR /scratch/$USER
mpirun -machinefile $PBS_NODEFILE -np 80 ~/test/cp2k-5.1.0/exe/local/cp2k.popt -i ATP-1.restart >& out
程序也会在一段时间后挂起,例如:-
开头的输出:-
----------------------------------- OT ---------------------------------------
Step Update method Time Convergence Total energy
Change
------------------------------------------------------------------------------
1 OT DIIS 0.80E-01 54.3 0.00002715 -8803.0497995708 -8.80E+03
2 OT DIIS 0.80E-01 18.8 0.00005469 -8803.0494664995 3.33E-04
3 OT DIIS 0.80E-01 19.0 0.00001678 -8803.0507564351 -1.29E-03
4 OT DIIS 0.80E-01 18.9 0.00001380 -8803.0508931318 -1.37E-04
5 OT DIIS 0.80E-01 19.0 0.00000619 -8803.0510930570 -2.00E-04
*** SCF run converged in 5 steps ***
一段时间后输出:-
----------------------------------- OT ---------------------------------------
Step Update method Time Convergence Total energy Change
------------------------------------------------------------------------------
1 OT DIIS 0.80E-01 543.5 0.00005264 -8803.0309338155 -8.80E+03
2 OT DIIS 0.80E-01 129.1 0.00017122 -8803.0214844607 9.45E-03
3 OT DIIS 0.80E-01 97.0 0.00001549 -8803.0324199550 -1.09E-02
4 OT DIIS 0.80E-01 104.3 0.00001280 -8803.0325293227 -1.09E-04
5 OT DIIS 0.80E-01 108.0 0.00000682 -8803.0327023147 -1.73E-04
*** SCF run converged in 5 steps ***
有人知道发生了什么吗?我非常感谢您的帮助。