使用ompi-server连接独立启动进程的错误

时间:2014-08-05 23:39:25

标签: c++ mpi openmpi

我是Open MPI的新手并试图弄明白。我希望能够稍后启动进程以连接到可能通过ompi-server在另一个节点上运行的先前启动的进程,但我不断从客户端收到错误。经过几个小时的寻找答案,我终于要问了。

ompi-server --no-daemonize -d -r +
[kurenai:15711] procdir: /tmp/openmpi-sessions-barronj@kurenai_0/30031/0/0
[kurenai:15711] jobdir: /tmp/openmpi-sessions-barronj@kurenai_0/30031/0
[kurenai:15711] top: openmpi-sessions-barronj@kurenai_0
[kurenai:15711] tmp: /tmp
[kurenai:15711] sess_dir_cleanup: job session dir does not exist
[kurenai:15711] procdir: /tmp/openmpi-sessions-barronj@kurenai_0/30031/0/0
[kurenai:15711] jobdir: /tmp/openmpi-sessions-barronj@kurenai_0/30031/0
[kurenai:15711] top: openmpi-sessions-barronj@kurenai_0
[kurenai:15711] tmp: /tmp
1968111616.0;tcp://192.168.1.219:55602
[kurenai:15711] [[30031,0],0] orte-server: up and running!

之后,我运行服务器。

mpirun -np 1 --hostfile ~/mpi-hosts --ompi-server "1968111616.0;tcp://192.168.1.219:55602" /home/barronj/ompi_test/port_server
port = 1982005248.0;tcp://192.168.1.219:38916+1982005249.0;tcp://192.168.1.219:41605:300

以下是服务器运行的相关代码。

try {
    MPI::Open_port(MPI::INFO_NULL, port);
} catch (MPI::Exception e) {
    fprintf(stderr, "Server open port error (%d): %s\n", e.Get_error_code(), e.Get_error_string());
    MPI::Finalize();
    return EXIT_FAILURE;
}

MPI::Info info = MPI::Info::Create();
info.Set("ompi_global_scope", "true");

try {
    MPI::Publish_name("test_service", info, port);
} catch (MPI::Exception e) {
    fprintf(stderr, "Server service publish error (%d): %s\n", e.Get_error_code(), e.Get_error_string());
    info.Free();
    MPI::Close_port(port);
    MPI::Finalize();
    return EXIT_FAILURE;
}

info.Free();

printf("port = %s\n", port);

try {
    intercomm = MPI::COMM_SELF.Accept(port, MPI::INFO_NULL, 0);
} catch (MPI::Exception e) {
    fprintf(stderr, "Server accept error (%d): %s\n", e.Get_error_code(), e.Get_error_string());
    MPI::Unpublish_name("test_service", MPI::INFO_NULL, port);
    MPI::Close_port(port);
    MPI::Finalize();
    return EXIT_FAILURE;
}

在另一个节点上,我运行客户端并收到错误。

mpirun -np 1 --hostfile ~/mpi-hosts --ompi-server "1968111616.0;tcp://192.168.1.219:55602" /home/barronj/ompi_test/port_client
barronj@kurenai's password:
Client found test_service on port, 1982005248.0;tcp://192.168.1.219:38916+1982005249.0;tcp://192.168.1.219:41605:300
[athena:07039] [[28058,0],0]-[[30243,0],0] mca_oob_tcp_peer_send_handler: invalid connection state (6) on socket 19

可以看出,找到了服务和端口。但连接会导致错误。这是相关的客户代码。

try {
    MPI::Lookup_name("test_service", MPI_INFO_NULL, port);
} catch (MPI::Exception e) {
    fprintf(stderr, "Service lookup error (%d): %s\n", e.Get_error_code(), e.Get_error_string());
    MPI::Finalize();
    return EXIT_FAILURE;
}

printf("Client found test_service on port, %s\n", port);

try {
    intercomm = MPI::COMM_SELF.Connect(port, MPI_INFO_NULL, 0);
} catch (MPI::Exception e) {
    fprintf(stderr, "Client connect error (%d): %s\n", e.Get_error_code(), e.Get_error_string());
    MPI::Finalize();
    return EXIT_FAILURE;
}

因为我是新手,我还没有完全理解这些东西。我尝试过使用MPI :: COMM_WORLD。这不能解决它。

我不确定这是否相关,但我尝试添加wait for server选项。

mpirun -np 1 --hostfile ~/mpi-hosts --ompi-server "1968111616.0;tcp://192.168.1.219:55602" --wait-for-server /home/barronj/ompi_test/port_client
--------------------------------------------------------------------------
mpirun was instructed to wait for the requested ompi-server, but was unable to
establish contact with the server during the specified wait time:

Server uri:  1968111616.0;tcp://192.168.1.219:55602
Timeout time: 10

Error received: Not supported

Please check to ensure that the requested server matches the actual server
information, and that the server is in operation.
--------------------------------------------------------------------------

将此选项添加到服务器会做同样的事情。

我也尝试使用--ompi-server和文件而不是复制粘贴。这只会产生同样的问题。

感谢任何帮助。谢谢。

0 个答案:

没有答案