Question

我正在尝试将C ++中的生物学模型与boost::mpi并行化。这是我的第一次尝试，对Boost库是一个全新的人（我从Schaling的Boost C ++ Libraries一书开始）。该模型由网格单元和生活在每个网格单元内的个人队列组成。这些类是嵌套的，因此Cohorts*的向量属于GridCell。该模型运行了1000年，并且在每个时间步上都有分散，因此个人队列在网格单元之间随机移动。我想并行化for循环的内容，而不是并行化循环本身，因为每个时间步都取决于上一时间的状态。

我使用world.send()和world.recv()将必要的信息从一个等级发送到另一个等级。因为有时在mpi::status和world.iprobe()中使用的等级之间没有什么可发送的，以确保代码不会因等待从未发送过的消息而挂起（我跟随this tutorial）

我的代码的第一部分似乎运行良好，但是在进入for循环的下一步之前，我很难确保已收到所有已发送的消息。实际上，我注意到某些等级在其他等级有时间发送其消息之前（或至少从输出看起来是这样）移到了下一个时间步

我没有发布代码，因为它包含几个类，而且很长。如果有兴趣，代码在github上。我在这里大致写出伪代码。我希望这足以理解问题。

int main()
{
    // initialise the GridCells and Cohorts living in them

    //depending on the number of cores requested split the 
    //grid cells that are processed by each core evenly, and 
    //store the relevant grid cells in a vector of  GridCell*

    // start to loop through each time step
    for (int k = 0; k < (burnIn+simTime); k++) 
    {
        // calculate the survival and reproduction probabilities 
        // for each Cohort and the dispersal probability

        // the dispersing Cohorts are sorted based on the rank of
        // the destination and stored in multiple vector<Cohort*>

        // I send the vector<Cohort*> with 
        world.send(…)

        // the receiving rank gets the vector of Cohorts with: 
        mpi::status statuses[world.size()];
        for(int st = 0; st < world.size(); st++)
        {
            ....
            if( world.iprobe(st, tagrec) )    
            statuses[st] = world.recv(st, tagrec, toreceive[st]);
            //world.iprobe ensures that the code doesn't hang when there
            // are no dispersers
        }
        // do some extra calculations here

        //wait that all processes are received, and then the time step ends. 
        //This is the bit where I am stuck. 
        //I've seen examples with wait_all for the non-blocking isend/irecv,
        // but I don't think it is applicable in my case.
        //The problem is that I noticed that some ranks proceed to the next
        //time step before all the other ranks have sent their messages.
    }
}

我用

编译

mpic++ -I/$HOME/boost_1_61_0/boost/mpi -std=c++11  -Llibdir \-lboost_mpi -lboost_serialization -lboost_locale  -o out

并使用mpirun -np 5 out执行，但是我希望以后可以在HPC集群上使用更多数量的内核来执行（该模型将在全局范围内运行，并且单元数可能会取决于用户选择的网格单元大小。安装的编译器是g ++（Ubuntu 7.3.0-27ubuntu1〜18.04）7.3.0，打开的MPI：2.1.1

Answer 1

您无可发送的事实是您方案中的重要信息。您不能仅从没有消息中推断出这一事实。没有消息仅表示还没发送还 。

仅发送零大小的向量并跳过探测是最简单的方法。

否则，您可能必须彻底改变自己的方法或实施非常复杂的投机执行/回滚机制。

还请注意，链接的教程以非常不同的方式使用探针。

在for循环中处理复杂的发送recv消息

1 个答案: