在多台计算机上运行mpi

时间:2015-10-12 12:53:21

标签: ssh mpi distributed-computing mpich

我可以在一台具有任意数量进程的计算机上运行我的mpi程序,但不能在多台计算机上运行。我有一个“机器”文件,它将主机上的进程计数指定为:

localhost:6
another_host:4

下面,我举了3个例子:

// When I run the program on only localhost, everything is OK.
mpirun -n 10 ./myMpiProg parameter1 parameter2

// In this case, everything is OK, too.
mpirun -f machinesFile -n 10 ./myMpiProg parameter1 parameter2

// This is also OK
mpirun -n 8 ./myMpiProg parameter1 parameter2

当我更改机器文件时,如下所示:

localhost:6
another_host:2

...

// But this does not work.
mpirun -f machinesFile -n 8 ./myMpiProg parameter1 parameter2

当我在分布式环境中运行程序时,会出现以下错误。更有趣的是,它总是发生在一些发行版中:例如8个进程,12个进程。它永远不会发生在10个过程中。

terminate called after throwing an instance of 'std::length_error' what():  vector::reserve

那么,在一台机器上运行mpi程序和多台机器之间有什么区别吗?

1 个答案:

答案 0 :(得分:0)

我意外地发现了这个问题,但仍然不知道为什么。当我在向量中保存isend请求时,一切正常。但如果我不保存它们,就会出现错误。它有时是std :: length :: error,有时甚至更长。

我可以提到的代码可以在https://stackoverflow.com/a/33375285/2979477中找到。 如果我改变这一行:

mpiSendRequest.push_back(world.isend(neighbors[j], 100, *p));

为:

world.isend(neighbors[j], 100, *p);

出现错误。这对我来说没有意义,但也许有一个合理的解释。

错误讯息:

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::mpi::exception> >'
what():  MPI_Alloc_mem: Unable to allocate memory for MPI_Alloc_mem, error stack:
MPI_Alloc_mem(115): MPI_Alloc_mem(size=1600614252, MPI_INFO_NULL, baseptr=0x7fffbb499e90) failed
MPI_Alloc_mem(96).: Unable to allocate memory for MPI_Alloc_mem
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::mpi::exception> >'
what():  MPI_Alloc_mem: Unable to allocate memory for MPI_Alloc_mem, error stack:
MPI_Alloc_mem(115): MPI_Alloc_mem(size=1699946540, MPI_INFO_NULL, baseptr=0x7fffdad0ee10) failed
MPI_Alloc_mem(96).: Unable to allocate memory for MPI_Alloc_mem
[proxy:0:1@mpi_notebook] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
[proxy:0:1@mpi_notebook] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@mpi_notebook] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)