我已经根据https://www.open-mpi.org/faq/?category=java在本地编译了具有Java支持的OpenMPI。在使用Oracle Java 8的本地计算机上,此方法工作正常,但在使用OpenJDK 8的群集上,此方法导致MPI Init挂起。您对如何从此处进行操作有任何指示吗? Dtrace?玩其他版本的Java吗?在Java版本方面,我找不到有关此接口支持什么的任何文档。
package com.acme.hello;
import mpi.*;
public class HelloMpi {
public static void main(String args[]) throws Exception {
int me,size;
System.out.println("attempting MPI init");
args=MPI.Init(args);
System.out.println("MPI init done");
}
}
> java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
> ~/NQSIM/java$ mpirun -version
mpirun (Open MPI) 3.1.2
> ~/NQSIM/java$ mpirun -np 2 java -classpath
"./target/test-classes/" com.acme.hello.HelloMpi
attempting MPI init
attempting MPI init
(hangs here forever)
编辑:examples / hello_c显示相同的行为,因此与Java无关。我想这一定是运输中的东西。我只能使用用户权限来构建/安装OpenMPI。系统上有一个现有的OpenMPI,但不支持Java。有关如何进行的任何想法?
Edit2:切换到其他字节层,例如使用--mca btl vader,self
可以正常工作。以下是聚会结束前--mca btl_base_verbose
的输出:
[fdr4:33013] mca: base: components_register: registering framework btl components
[fdr4:33013] mca: base: components_register: found loaded component sm
[fdr4:33014] mca: base: components_register: registering framework btl components
[fdr4:33014] mca: base: components_register: found loaded component sm
[fdr4:33013] mca: base: components_register: component sm register function successful
[fdr4:33013] mca: base: components_register: found loaded component self
[fdr4:33014] mca: base: components_register: component sm register function successful
[fdr4:33013] mca: base: components_register: component self register function successful
[fdr4:33014] mca: base: components_register: found loaded component self
[fdr4:33013] mca: base: components_register: found loaded component tcp
[fdr4:33014] mca: base: components_register: component self register function successful
[fdr4:33013] mca: base: components_register: component tcp register function successful
[fdr4:33014] mca: base: components_register: found loaded component tcp
[fdr4:33013] mca: base: components_register: found loaded component vader
[fdr4:33013] mca: base: components_register: component vader register function successful
[fdr4:33013] mca: base: components_register: found loaded component openib
[fdr4:33014] mca: base: components_register: component tcp register function successful
[fdr4:33014] mca: base: components_register: found loaded component vader
[fdr4:33014] mca: base: components_register: component vader register function successful
[fdr4:33014] mca: base: components_register: found loaded component openib
[fdr4:33013] mca: base: components_register: component openib register function successful
[fdr4:33013] mca: base: components_open: opening btl components
[fdr4:33013] mca: base: components_open: found loaded component sm
[fdr4:33013] mca: base: components_open: component sm open function successful
[fdr4:33013] mca: base: components_open: found loaded component self
[fdr4:33013] mca: base: components_open: component self open function successful
[fdr4:33013] mca: base: components_open: found loaded component tcp
[fdr4:33013] mca: base: components_open: component tcp open function successful
[fdr4:33013] mca: base: components_open: found loaded component vader
[fdr4:33013] mca: base: components_open: component vader open function successful
[fdr4:33013] mca: base: components_open: found loaded component openib
[fdr4:33013] mca: base: components_open: component openib open function successful
[fdr4:33013] select: initializing btl component sm
[fdr4:33014] mca: base: components_register: component openib register function successful
[fdr4:33014] mca: base: components_open: opening btl components
[fdr4:33014] mca: base: components_open: found loaded component sm
[fdr4:33014] mca: base: components_open: component sm open function successful
[fdr4:33014] mca: base: components_open: found loaded component self
[fdr4:33014] mca: base: components_open: component self open function successful
[fdr4:33014] mca: base: components_open: found loaded component tcp
[fdr4:33014] mca: base: components_open: component tcp open function successful
[fdr4:33014] mca: base: components_open: found loaded component vader
[fdr4:33014] mca: base: components_open: component vader open function successful
[fdr4:33014] mca: base: components_open: found loaded component openib
[fdr4:33014] mca: base: components_open: component openib open function successful
[fdr4:33014] select: initializing btl component sm
[fdr4:33014] select: init of component sm returned success
[fdr4:33014] select: initializing btl component self
[fdr4:33014] select: init of component self returned success
[fdr4:33014] select: initializing btl component tcp
[fdr4:33013] select: init of component sm returned success
[fdr4:33013] select: initializing btl component self
[fdr4:33013] select: init of component self returned success
[fdr4:33013] select: initializing btl component tcp
[fdr4:33014] select: init of component tcp returned success
[fdr4:33014] select: initializing btl component vader
[fdr4:33013] select: init of component tcp returned success
[fdr4:33013] select: initializing btl component vader
[fdr4:33014] select: init of component vader returned success
[fdr4:33014] select: initializing btl component openib
[fdr4:33013] select: init of component vader returned success
[fdr4:33013] select: initializing btl component openib
[fdr4:33014] Checking distance from this process to device=mlx4_0
[fdr4:33013] Checking distance from this process to device=mlx4_0
[fdr4:33013] hwloc_distances->nbobjs=4
[fdr4:33013] hwloc_distances->latency[0]=1.000000
[fdr4:33013] hwloc_distances->latency[1]=2.000000
[fdr4:33013] hwloc_distances->latency[2]=3.000000
[fdr4:33014] hwloc_distances->nbobjs=4
[fdr4:33014] hwloc_distances->latency[0]=1.000000
[fdr4:33014] hwloc_distances->latency[1]=2.000000
[fdr4:33014] hwloc_distances->latency[2]=3.000000
[fdr4:33013] hwloc_distances->latency[3]=2.000000
[fdr4:33013] hwloc_distances->latency[4]=2.000000
[fdr4:33013] hwloc_distances->latency[5]=1.000000
[fdr4:33013] hwloc_distances->latency[6]=2.000000
[fdr4:33013] hwloc_distances->latency[7]=3.000000
[fdr4:33013] ibv_obj->logical_index=1
[fdr4:33014] hwloc_distances->latency[3]=2.000000
[fdr4:33014] hwloc_distances->latency[4]=2.000000
[fdr4:33014] hwloc_distances->latency[5]=1.000000
[fdr4:33014] hwloc_distances->latency[6]=2.000000
[fdr4:33014] hwloc_distances->latency[7]=3.000000
[fdr4:33014] ibv_obj->logical_index=1
[fdr4:33013] my_obj->logical_index=0
[fdr4:33013] Process is bound: distance to device is 2.000000
[fdr4:33014] my_obj->logical_index=0
[fdr4:33014] Process is bound: distance to device is 2.000000
[fdr4:33013] [rank=0] openib: using port mlx4_0:1
[fdr4:33013] select: init of component openib returned success
[fdr4:33014] [rank=1] openib: using port mlx4_0:1
[fdr4:33014] select: init of component openib returned success
[fdr4:33013] mca: bml: Using self btl for send to [[59315,1],0] on node fdr4
[fdr4:33014] mca: bml: Using self btl for send to [[59315,1],1] on node fdr4
[fdr4:33013] mca: bml: Using vader btl for send to [[59315,1],1] on node fdr4
[fdr4:33014] mca: bml: Using vader btl for send to [[59315,1],0] on node fdr4
答案 0 :(得分:0)
已解决。在这种情况下,问题是施加给用户的限制之一。该服务器已配置为使用默认设置,但是在/etc/security/limits.conf
中更改以下内容后,它开始使用默认字节层(因为我自己无法直接对其进行测试,很遗憾,我不知道这两个设置中的哪个是肇事者):
* - memlock unlimited
* - nofile 16384