我刚开始学习有关mpi的内容,所以我买了3个vps来创建一个实验环境。我成功安装并配置了ssh和mpich。三个节点可以在没有密码的情况下互相ssh(但不是自己)。并且cpi示例在本地计算机上没有任何ptoblem传递。当我试图在所有3个节点上运行它时,cpi程序总是存在错误
Fatal error in PMPI_Reduce: Unknown error class, error stack:
。
以下是我所做的以及错误所说的完整描述。
[root@fire examples]# mpiexec -f ~/mpi/machinefile -n 6 ./cpi
Process 3 of 6 is on mpi0
Process 0 of 6 is on mpi0
Process 1 of 6 is on mpi1
Process 2 of 6 is on mpi2
Process 4 of 6 is on mpi1
Process 5 of 6 is on mpi2
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1263)...............: MPI_Reduce(sbuf=0x7fff1c18c440, rbuf=0x7fff1c18c448, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(826)..........:
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(881)..........:
MPIR_Reduce_binomial(188).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(636): Communication error with rank 1
MPIR_Reduce_binomial(188).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(636): Communication error with rank 2
MPIR_Reduce_intra(846)..........:
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(881)..........:
MPIR_Reduce_binomial(250).......: Failure during collective
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 1563 RUNNING AT mpi0
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:2@mpi2] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:2@mpi2] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@mpi2] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:1@mpi1] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:1@mpi1] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1@mpi1] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@mpi0] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@mpi0] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@mpi0] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@mpi0] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion
我不知道发生了什么,一些见解? 正如评论所示,这是mpi cpi代码。
#include "mpi.h"
#include <stdio.h>
#include <math.h>
double f(double);
double f(double a)
{
return (4.0 / (1.0 + a*a));
}
int main(int argc,char *argv[])
{
int n, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
double startwtime = 0.0, endwtime;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name,&namelen);
fprintf(stdout,"Process %d of %d is on %s\n",
myid, numprocs, processor_name);
fflush(stdout);
n = 10000; /* default # of rectangles */
if (myid == 0)
startwtime = MPI_Wtime();
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
h = 1.0 / (double) n;
sum = 0.0;
/* A slightly better approach starts from large i and works back */
for (i = myid + 1; i <= n; i += numprocs)
{
x = h * ((double)i - 0.5);
sum += f(x);
}
mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0) {
endwtime = MPI_Wtime();
printf("pi is approximately %.16f, Error is %.16f\n",
pi, fabs(pi - PI25DT));
printf("wall clock time = %f\n", endwtime-startwtime);
fflush(stdout);
}
MPI_Finalize();
return 0;
}
答案 0 :(得分:4)
它可能为时已晚,无论如何我会提供我的答案,我遇到了同样的问题,经过一些研究我发现了问题
如果你有一个带有主机名而不是ip-addresses的机器文件,并且让机器在本地连接,那么你应该在本地运行一个名称服务器,或者将机器文件中的条目更改为ip-address而不是主机名。只有/ etc / hosts才能解决问题
这似乎是我的问题,一旦我将机器文件中的entires更改为ip-addresses就行了
此致 GOPI
答案 1 :(得分:0)
我的四个Raspberry Pi集群(模型B)有同样的问题。
我设置了我的RASPBIAN版本,为我的防火墙使用“ufw”,并设置“ssh”以使用带有“密码”的RSA密钥为每个Raspberry Pi。直到我将每个pi的公钥(参见ssh-copy-id)分发给每个其他pi,我才能通过上述错误消息。
请注意,运行ssh-agent然后在运行“mpiexec”之前在每个Raspberry Pi上运行ssh-add有点乏味(我还是要了解pssh / parallel-ssh是否可以帮助设置)。