Question

我已经使用ompi配置了两个主机，并且我能够在单独的情况下成功运行以下示例代码

#include "mpi.h"
#include <stdio.h>

int main(argc,argv)
int argc;
char *argv[];  {
int numtasks, rank, dest, source, rc, count, tag=1;
char inmsg, outmsg='x';
MPI_Status Stat;

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) {
  dest = 1;
  source = 1;
  rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
  rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}

else if (rank == 1) {
  dest = 0;
  source = 0;
  rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
  rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
}

rc = MPI_Get_count(&Stat, MPI_CHAR, &count);
printf("Task %d: Received %d char(s) from task %d with tag %d \n",
   rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);

MPI_Finalize();
}

mpirun -np 2 sendReceive.o

工作正常。

mpirun -np 2 --host host1，host1 sendReceive.o

[ip-172-31-71-xx:11221] [[55975,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 398
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[55975,0],0] on node ip-172-31-78-xx
  Remote daemon: [[55975,0],1] on node ip-172-31-71-xx

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

我验证了我可以在主机之间进行ssh并正确配置。我无法在这里缩小问题范围。有什么建议吗？

答案：我错误地在每个系统中使用不同版本的mpi。当我更正版本时，它正在工作!!!

Answer 1

您必须允许安全组在主机内通过mpi通信。您可以通过首先将MPI通信限制到特定端口范围并在自定义TCP端口范围下的安全组中允许此端口范围来解决此问题。然后你应该能够按预期工作。要限制端口范围，请参阅openmpi-mca-params.conf（根据配置文件：）

默认情况下，搜索两个文件（按顺序）：

$HOME/.openmpi/mca-params.conf：用户提供的值集具有最高优先级。

$prefix/etc/openmpi-mca-params.conf：系统提供的一组值的优先级较低。

允许安全组通信自定义TCP端口，

转到EC2管理控制台
转到安全组
选择相关安全组，在入站连接下，单击编辑。
添加您提前选择的端口范围。

CLOSED OpenMPI：多个主机上的mpirun错误

1 个答案: