打开MPI - 多个主机上的ring_c失败

时间:2016-04-03 21:58:35

标签: testing installation openmpi

我在两个Ubuntu 14.04主机上打开了recently installed个MPI,我现在用两个提供的测试函数hello_c和ring_c来测试它的功能。主持人被称为“爱马仕”。和' zeus'他们都有用户' mpiuser'以非交互方式登录(通过ssh-agent)。

功能mpirun hello_cmpirun --host hermes,zeus hello_c都能正常运作。

在本地调用函数mpirun --host zeus ring_c也有效。 hermes和zeus的输出:

mpiuser@zeus:/opt/openmpi-1.6.5/examples$ mpirun --host zeus ring_c
Process 0 sending 10 to 0, tag 201 (1 processes in ring)
Process 0 sent to 0
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting

但调用函数mpirun --host zeus,hermes ring_c失败并提供以下输出:

mpiuser@zeus:/opt/openmpi-1.6.5/examples$ mpirun --host hermes,zeus ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
[zeus:2930] *** An error occurred in MPI_Recv
[zeus:2930] *** on communicator MPI_COMM_WORLD
[zeus:2930] *** MPI_ERR_TRUNCATE: message truncated
[zeus:2930] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
Process 0 sent to 1
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 2930 on
node zeus exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

我还没有找到关于如何解决这个问题的任何文档,我也不知道在哪里根据错误输出查找错误。 我该如何解决这个问题?

1 个答案:

答案 0 :(得分:0)

你在第一次和第二次运行之间改变了两件事 - 你已经将进程数从1增加到2,并且在多个主机而不是单个主机上运行。

我建议您首先检查您是否可以在同一主机上运行2个进程:

mpirun -n 2 ring_c

看看你得到了什么。

在群集上进行调试时,了解每个进程的运行位置通常很有用。您应该始终打印出进程总数。尝试使用ring_c.c顶部的以下代码:

char nodename[MPI_MAX_PROCESSOR_NAME];
int namelen;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

MPI_Get_processor_name(nodename, &namelen);
printf("Rank %d out of %d running on node %s\n", rank, size, nodename);

你得到的错误是传入的消息对于接收缓冲区来说太大了,这很奇怪,因为代码总是发送和接收一个整数。