MPI(OpenMPI) - MPI_Publish_name无法联系全局ompi-server并抛出错误

时间:2014-05-03 23:18:14

标签: mpi openmpi

我正在尝试编写一个MPI应用程序,它包含服务器客户端模块中的程序。我试图让服务器将其名称发布到全局范围内的ompi-server

这是服务器代码:

int main(int argc, char** argv) {
int myrank, nprocs, errmpi;

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
char port_name[MPI_MAX_PORT_NAME];
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "yes");
MPI_Open_port(info, port_name);

//Fails here
MPI_Publish_name("ServerName", info, port_name);

// Rest of code...

运行它时出现以下错误:

$ ./mpi/bin/mpirun -np 1 --mca btl self ServerName
--------------------------------------------------------------------------
Process rank 0 attempted to publish to a global ompi_server that
could not be contacted. This is typically caused by either not
specifying the contact info for the server, or by the server not
currently executing. If you did specify the contact info for a
server, please check to see that the server is running and start
it again (or have your sys admin start it) if it isn't.

--------------------------------------------------------------------------
[xxx:18205] *** An error occurred in MPI_Publish_name
[xxx:18205] *** reported by process [1424949249,139676631433216]
[xxx:18205] *** on communicator MPI_COMM_WORLD
[xxx:18205] *** MPI_ERR_INTERN: internal error
[xxx:18205] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[xxx:18205] ***    and potentially your MPI job)

我确实在控制台上以调试模式运行ompi-server进程

$ ./ompi-server --no-daemonize -d -r +
[xxx:14140] [[9416,0],0] orte-server: up and running!

最终,我将在各个节点之间分发流程,但是现在我真的希望让框架在单个节点上运行。有人可以帮忙吗?非常感谢!

编辑1 :非常感谢您的快速回复。我做了以下更改

$mpi/bin/ompi-server --no-daemonize -d -r mpiuri

如果我现在运行该程序,我发现该程序在之前失败的位置挂起

$./mpi/bin/mpirun --ompi-server file:mpiuri -mca btn tcp,self,sm -np 1 -v Server

如果我使用以下内容运行程序,

$ ./mpi/bin/mpirun --ompi-server file:mpiuri -mca btn tcp,self,sm -np 1 -v --wait-for-server --server-wait-time 10 Server

出现以下错误

--------------------------------------------------------------------------
mpirun was instructed to wait for the requested ompi-server, but was unable to
establish contact with the server during the specified wait time:

Server uri:  799801344.0;tcp://192.168.1.113:44487
Timeout time: 10

Error received: Not supported

Please check to ensure that the requested server matches the actual server
information, and that the server is in operation.
--------------------------------------------------------------------------

我必须亲近......但我不能理解它

我很确定它不是防火墙,因为我将规则ALLOW 192.168.1.0/24添加到了ufw

1 个答案:

答案 0 :(得分:1)

以下是如何连接ompi-server

1)确保ompi服务器已启动并正在运行,并使用以下命令将其uri写入文件

$mpi/bin/ompi-server --no-daemonize -d -r mpiuri

2)使用此uri文件启动所有mpi进程,确保您

  1. 使用"文件前缀uri文件名:"当你进入 --ompi-server参数
  2. 输入运行mpirun的节点的主机名...就像这样

    $。/ mpi / bin / mpirun --ompi-server file:mpiuri -host myHostName -np 1 -v Server