Question

我使用12个节点的Windows HPC集群（每个具有24个核心）来运行C ++ MPI程序（使用Boost MPI）。一次运行MPI降低，一则注释掉MPI降低（仅用于速度测试）。运行时间为01:17:23和01:03:49。在我看来，减少MPI需要花费大量时间。我认为尝试首先在节点级别上减少，然后减少到头节点以提高性能可能是值得的。

以下是用于测试目的的简单示例。假设有4个计算机节点，每个节点有2个核心。我想首先在每个节点上使用mpi进行减少。之后，减少到头节点。我对mpi不太熟悉，下面的程序崩溃。

#include <iostream>
#include <boost/mpi.hpp>
namespace mpi = boost::mpi;
using namespace std;

int main()
{
  mpi::environment env;
  mpi::communicator world;

  int i = world.rank();


  boost::mpi::communicator local = world.split(world.rank()/2); // total 8 cores, divide in 4 groups
  boost::mpi::communicator heads = world.split(world.rank()%4);

  int res = 0;

  boost::mpi::reduce(local, i, res, std::plus<int>(), 0);
  if(world.rank()%2==0)
  cout<<res<<endl;
  boost::mpi::reduce(heads, res, res, std::plus<int>(), 0);

  if(world.rank()==0)
      cout<<res<<endl;

  return 0;
}

输出难以辨认，类似这样

Z
h
h
h
h
a
a
a
a
n
n
n
n
g
g
g
g
\
\
\
\
b
b
b
b
o
o
o
o
o
o
o
o
s
...
...
...

错误消息是

Test.exe ended prematurely and may have crashed. exit code 3

我怀疑我在分组拆分或减少时做错了什么，但经过几次试验却无法解决。如何更改才能使这项工作成功？谢谢。

Answer 1

使用现金的原因是因为您在下一行中两次将相同的变量传递给MPI

boost::mpi::reduce(heads, res, res, std::plus<int>(), 0);

在Boost.MPI中没有很好的记录，但是boost通过引用接受了这些并将各自的指针传递给MPI。 MPI通常禁止您将同一缓冲区两次传递给同一调用。确切地说，传递给MPI函数的输出缓冲区不得与该调用中传递的任何其他缓冲区混叠（重叠）。

您可以通过创建res的副本来轻松解决此问题。

我还认为您可能希望限制使用local.rank() == 0的进程调用第二个reduce。

还要重申这一评论-我怀疑您将从重新实施减排中获得任何好处。尝试优化瓶颈无法完全理解的性能问题通常是一个坏主意。

C ++ MPI，使用多个节点，首先在节点级别减少，然后减少到头节点

1 个答案: