我是Open MPI的新手并试图弄明白。我希望能够稍后启动进程以连接到可能通过ompi-server在另一个节点上运行的先前启动的进程,但我不断从客户端收到错误。经过几个小时的寻找答案,我终于要问了。
ompi-server --no-daemonize -d -r +
[kurenai:15711] procdir: /tmp/openmpi-sessions-barronj@kurenai_0/30031/0/0
[kurenai:15711] jobdir: /tmp/openmpi-sessions-barronj@kurenai_0/30031/0
[kurenai:15711] top: openmpi-sessions-barronj@kurenai_0
[kurenai:15711] tmp: /tmp
[kurenai:15711] sess_dir_cleanup: job session dir does not exist
[kurenai:15711] procdir: /tmp/openmpi-sessions-barronj@kurenai_0/30031/0/0
[kurenai:15711] jobdir: /tmp/openmpi-sessions-barronj@kurenai_0/30031/0
[kurenai:15711] top: openmpi-sessions-barronj@kurenai_0
[kurenai:15711] tmp: /tmp
1968111616.0;tcp://192.168.1.219:55602
[kurenai:15711] [[30031,0],0] orte-server: up and running!
之后,我运行服务器。
mpirun -np 1 --hostfile ~/mpi-hosts --ompi-server "1968111616.0;tcp://192.168.1.219:55602" /home/barronj/ompi_test/port_server
port = 1982005248.0;tcp://192.168.1.219:38916+1982005249.0;tcp://192.168.1.219:41605:300
以下是服务器运行的相关代码。
try {
MPI::Open_port(MPI::INFO_NULL, port);
} catch (MPI::Exception e) {
fprintf(stderr, "Server open port error (%d): %s\n", e.Get_error_code(), e.Get_error_string());
MPI::Finalize();
return EXIT_FAILURE;
}
MPI::Info info = MPI::Info::Create();
info.Set("ompi_global_scope", "true");
try {
MPI::Publish_name("test_service", info, port);
} catch (MPI::Exception e) {
fprintf(stderr, "Server service publish error (%d): %s\n", e.Get_error_code(), e.Get_error_string());
info.Free();
MPI::Close_port(port);
MPI::Finalize();
return EXIT_FAILURE;
}
info.Free();
printf("port = %s\n", port);
try {
intercomm = MPI::COMM_SELF.Accept(port, MPI::INFO_NULL, 0);
} catch (MPI::Exception e) {
fprintf(stderr, "Server accept error (%d): %s\n", e.Get_error_code(), e.Get_error_string());
MPI::Unpublish_name("test_service", MPI::INFO_NULL, port);
MPI::Close_port(port);
MPI::Finalize();
return EXIT_FAILURE;
}
在另一个节点上,我运行客户端并收到错误。
mpirun -np 1 --hostfile ~/mpi-hosts --ompi-server "1968111616.0;tcp://192.168.1.219:55602" /home/barronj/ompi_test/port_client
barronj@kurenai's password:
Client found test_service on port, 1982005248.0;tcp://192.168.1.219:38916+1982005249.0;tcp://192.168.1.219:41605:300
[athena:07039] [[28058,0],0]-[[30243,0],0] mca_oob_tcp_peer_send_handler: invalid connection state (6) on socket 19
可以看出,找到了服务和端口。但连接会导致错误。这是相关的客户代码。
try {
MPI::Lookup_name("test_service", MPI_INFO_NULL, port);
} catch (MPI::Exception e) {
fprintf(stderr, "Service lookup error (%d): %s\n", e.Get_error_code(), e.Get_error_string());
MPI::Finalize();
return EXIT_FAILURE;
}
printf("Client found test_service on port, %s\n", port);
try {
intercomm = MPI::COMM_SELF.Connect(port, MPI_INFO_NULL, 0);
} catch (MPI::Exception e) {
fprintf(stderr, "Client connect error (%d): %s\n", e.Get_error_code(), e.Get_error_string());
MPI::Finalize();
return EXIT_FAILURE;
}
因为我是新手,我还没有完全理解这些东西。我尝试过使用MPI :: COMM_WORLD。这不能解决它。
我不确定这是否相关,但我尝试添加wait for server选项。
mpirun -np 1 --hostfile ~/mpi-hosts --ompi-server "1968111616.0;tcp://192.168.1.219:55602" --wait-for-server /home/barronj/ompi_test/port_client
--------------------------------------------------------------------------
mpirun was instructed to wait for the requested ompi-server, but was unable to
establish contact with the server during the specified wait time:
Server uri: 1968111616.0;tcp://192.168.1.219:55602
Timeout time: 10
Error received: Not supported
Please check to ensure that the requested server matches the actual server
information, and that the server is in operation.
--------------------------------------------------------------------------
将此选项添加到服务器会做同样的事情。
我也尝试使用--ompi-server和文件而不是复制粘贴。这只会产生同样的问题。
感谢任何帮助。谢谢。