MPIrun崩溃与简单的代码

时间:2013-02-13 18:43:16

标签: boost mpi openmpi boost-mpi

我试图生成使用MPI的行为与具有依赖项的程序大致相似的代码。如果我使用多个处理器(例如mpirun -np X),其中X大于我试图建模的任务数(例如我的switch语句中的个案数),一切正常。我的程序模型有一个任务列表,每个任务的执行时间以及任务之间的一组依赖关系。我已经生成了类似于此的MPI代码(真实案例将包含50到600个任务,例如案例):

int main(int argc, char* argv[]) {
  mpi::environment env(argc, argv);
  mpi::communicator world;
  long execution_times [4] = {9, 4, 3, 6};

  switch (world.rank()) {
    case 1: {
      std::cout << "1: Awake" << std::endl;
      mpi::request req[1];
      req[0] = world.irecv(0, 0);
      mpi::wait_all(req, req + 1);
      std::cout << "1: Recv notice from pred 0" << std::endl;
      time_t start;
      start = time(NULL);
      std::cout << "1: Started compute" << std::endl;
      while ((time(NULL)-start) < execution_times[1]);
      std::cout << "1: Finished compute in " << (time(NULL)-start) << std::endl;
      mpi::request sreq[3];
      sreq[0] = world.isend(5, 0);
      sreq[1] = world.isend(23, 0);
      sreq[2] = world.isend(42, 0);
      mpi::wait_all(sreq, sreq + 3);
      std::cout << "1: Sent notice to succ 5" << std::endl;
      std::cout << "1: Sent notice to succ 23" << std::endl;
      std::cout << "1: Sent notice to succ 42" << std::endl;
      break; }
    // Other cases excluded for brevity...
   }
   return 0;
}

我可以使用g++ -L/usr/local/lib -lmpi -lmpi_cxx -lboost_serialization -lboost_mpi test.cpp进行编译,然后使用mpirun -np 4 a.out

运行它

然而,当达到超出处理器数量的情况时,我总是会遇到异常,例如

hamiltont$ mpirun -np 2 a.out 
0: Awake
0: Started compute
0: Finished compute in 0
1: Awake
1: Recv notice from pred 0
1: Started compute
libc++abi.dylib: terminate called throwing an exception
hamiltont$ mpirun -np 3 a.out 
0: Awake
0: Started compute
0: Finished compute in 0
1: Awake
1: Recv notice from pred 0
1: Started compute
2: Awake
2: Recv notice from pred 0
2: Started compute
libc++abi.dylib: terminate called throwing an exception

请注意,将处理器数量从2增加到3可以让我成功执行一个案例。我认为有些东西我不理解MPI

整个例外:

libc++abi.dylib: terminate called throwing an exception
[MacBook-Pro:47495] *** Process received signal ***
[MacBook-Pro:47495] Signal: Abort trap: 6 (6)
[MacBook-Pro:47495] Signal code:  (0)
[MacBook-Pro:47495] [ 0] 2   libsystem_c.dylib                   0x00007fff91e9b8ea _sigtramp + 26
[MacBook-Pro:47495] [ 1] 3   ???                                 0x0000000000000000 0x0 + 0
[MacBook-Pro:47495] [ 2] 4   libc++abi.dylib                     0x00007fff8f29ca17 abort_message + 257
[MacBook-Pro:47495] [ 3] 5   libc++abi.dylib                     0x00007fff8f29a3c6 _ZL17default_terminatev + 28
[MacBook-Pro:47495] [ 4] 6   libobjc.A.dylib                     0x00007fff94857887 _ZL15_objc_terminatev + 111
[MacBook-Pro:47495] [ 5] 7   libc++abi.dylib                     0x00007fff8f29a3f5 _ZL19safe_handler_callerPFvvE + 8
[MacBook-Pro:47495] [ 6] 8   libc++abi.dylib                     0x00007fff8f29a450 __cxa_bad_typeid + 0
[MacBook-Pro:47495] [ 7] 9   libc++abi.dylib                     0x00007fff8f29b5b7 _ZL23__gxx_exception_cleanup19_Unwind_Reason_CodeP17_Unwind_Exception + 0
[MacBook-Pro:47495] [ 8] 10  a.out                               0x00000001086a818e _ZN5boost15throw_exceptionINS_3mpi9exceptionEEEvRKT_ + 158
[MacBook-Pro:47495] [ 9] 11  libboost_mpi.dylib                  0x0000000108a061e7 _ZNK5boost3mpi12communicator5isendEii + 111
[MacBook-Pro:47495] [10] 12  a.out                               0x0000000108676fc9 main + 1257
[MacBook-Pro:47495] [11] 13  libdyld.dylib                       0x00007fff911837e1 start + 0
[MacBook-Pro:47495] [12] 14  ???                                 0x0000000000000001 0x0 + 1
[MacBook-Pro:47495] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 47495 on node MacBook-Pro.local exited on signal 6 (Abort trap: 6).

0 个答案:

没有答案