MPI_Irecv中的致命错误:中止作业

时间:2011-06-02 22:21:03

标签: c++ mpi

当我尝试在四个处理器上运行问题时,会收到以下错误序列。我使用的MPI命令是mpirun -np 4

我为发布错误消息而道歉(主要是由于缺乏解密所提供信息的知识)。非常感谢您对以下内容的意见:

  1. 错误消息是什么意思?人们在什么时候收到它?是因为系统内存(硬件)还是由于通信错误(与MPI_Isend / Irecv?有关,即软件问题)。

  2. 最后,我该如何解决这个问题?

  3. 谢谢!

    收到的错误消息如下: - - - * 请注意:只有在时间很长时才会收到此错误 *。 当计算数据所需的时间很短时,代码计算得很好(即,与1000个时间步长相比,300个时间步长)

    中止工作:

    MPI_Irecv中的致命错误:其他MPI错误,错误堆栈:

    MPI_Irecv(143):MPI_Irecv(buf = 0x8294a60,count = 48,MPI_DOUBLE,src = 2,tag = -1,MPI_COMM_WORLD,request = 0xffffd68c)失败

    MPID_Irecv(64):内存不足

    中止工作:

    MPI_Irecv中的致命错误:其他MPI错误,错误堆栈:

    MPI_Irecv(143):MPI_Irecv(buf = 0x8295080,count = 48,MPI_DOUBLE,src = 3,tag = -1,MPI_COMM_WORLD,request = 0xffffd690)失败

    MPID_Irecv(64):内存不足

    中止工作: MPI_Isend中的致命错误:内部MPI错误!,错误堆栈:

    MPI_Isend(142):MPI_Isend(buf = 0x8295208,count = 48,MPI_DOUBLE,dest = 3,tag = 0,MPI_COMM_WORLD,request = 0xffffd678)失败

    (未知)():内部MPI错误!

    中止工作: MPI_Irecv中的致命错误:其他MPI错误,错误堆栈:

    MPI_Irecv(143):MPI_Irecv(buf = 0x82959b0,count = 48,MPI_DOUBLE,src = 2,tag = -1,MPI_COMM_WORLD,request = 0xffffd678)失败

    MPID_Irecv(64):内存不足

    在工作1中排名第3的肌细胞80_37021引起所有队伍的集体堕胎   排名3退出状态:返回代码13

    第1工作1肌细胞80_37021引起所有队伍的集体堕胎   排名1的退出状态:返回代码13

    修改:     (消息来源

    Header files
    Variable declaration
    TOTAL TIME = 
    ...
    ...
    double *A = new double[Rows];
    double *AA = new double[Rows];
    double *B = new double[Rows;
    double *BB = new double[Rows];
    ....
    ....
    int Rmpi;
    int my_rank;
    int p;
    int source; 
    int dest;
    int tag = 0;
    function declaration
    
    int main (int argc, char *argv[])
    {
    MPI_Status status[8]; 
    MPI_Request request[8];
    MPI_Init (&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &p);   
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    
    //PROBLEM SPECIFIC PROPERTIES. VARY BASED ON NODE 
    if (Flag = 1)
    {
    if (my_rank == 0)
    {
    Defining boundary (start/stop) for special elements in tissue (Rows x Column)
    }
    if (my_rank == 2)
    ..
    if (my_rank == 3)
    ..
    if (my_rank == 4)
    ..
    }
    
    //INITIAL CONDITIONS ALSO VARY BASED ON NODE
    for (Columns = 0; Columns<48; i++) // Normal Direction
    {
    for (Rows = 0; Rows<48; y++)  //Transverse Direction
    {
    if (Flag =1 )
    {
    if (my_rank == 0)
    {
    Initial conditions for elements
    }
    if (my_rank == 1) //MPI
    {
    }
    ..
    ..
    ..
    //SIMULATION START
    
    while(t[0][0] < TOTAL TIME)
    {       
    for (Columns=0; Columns ++) //Normal Direction
    {
    for (Rows=0; Rows++) //Transverse Direction
    {
    //SOME MORE PROPERTIES BASED ON NODE
    if (my_rank == 0)
    {
    if (FLAG = 1)
    {
    Condition 1
    }   
     else
    {
    Condition 2 
    }
    }
    
    if (my_rank = 1)
    ....
     ....
      ...
    
    //Evaluate functions (differential equations)
    Function 1 ();
    Function 2 ();
    ...
    ...
    
    //Based on output of differential equations, different nodes estimate variable values. Since   
     the problem is of nearest neighbor, corners and edges have different neighbors/ boundary   
     conditions
    if (my_rank == 0)
    {
    If (Row/Column at bottom_left)                  
    {
    Variables =
    }
    
    if (Row/Column at Bottom Right) 
    {
    Variables =
    }
    }
    ...
     ...
    
     //Keeping track of time for each element in Row and Column. Time is updated for a certain  
     element. 
     t[Column][Row] = t[Column][Row]+dt;
    
      }
      }//END OF ROWS AND COLUMNS
    
     // MPI IMPLEMENTATION. AT END OF EVERY TIME STEP, Nodes communicate with nearest neighbor
     //First step is to populate arrays with values estimated above
     for (Columns, ++) 
     {
     for (Rows, ++) 
     {
     if (my_rank == 0)
     {
     //Loading the Edges of the (Row x Column) to variables. This One dimensional Array data 
     is shared with its nearest neighbor for computation at next time step.
    
     if (Column == 47)
     {
     A[i] = V[Column][Row]; 
     …
     }
     if (Row == 47)
     {
     B[i] = V[Column][Row]; 
     }
     }
    
    ...
    ...                 
    
     //NON BLOCKING MPI SEND RECV TO SHARE DATA WITH NEAREST NEIGHBOR
    
     if ((my_rank) == 0)
     {
     MPI_Isend(A, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &request[1]);
     MPI_Irecv(AA, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[3]);
     MPI_Wait(&request[3], &status[3]);  
     MPI_Isend(B, Rows, MPI_DOUBLE, my_rank+2, 0, MPI_COMM_WORLD, &request[5]);
     MPI_Irecv(BB, Rows, MPI_DOUBLE, my_rank+2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[7]);
     MPI_Wait(&request[7], &status[7]);
     }
    
    if ((my_rank) == 1)
    {
    MPI_Irecv(CC, Rows, MPI_DOUBLE, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[1]);
    MPI_Wait(&request[1], &status[1]); 
    MPI_Isend(Cmpi, Rows, MPI_DOUBLE, my_rank-1, 0, MPI_COMM_WORLD, &request[3]);
    
    MPI_Isend(D, Rows, MPI_DOUBLE, my_rank+2, 0, MPI_COMM_WORLD, &request[6]); 
    MPI_Irecv(DD, Rows, MPI_DOUBLE, my_rank+2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[8]);
    MPI_Wait(&request[8], &status[8]);
    }
    
    if ((my_rank) == 2)
    {
    MPI_Isend(E, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &request[2]);
    MPI_Irecv(EE, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[4]);
    MPI_Wait(&request[4], &status[4]);
    
    MPI_Irecv(FF, Rows, MPI_DOUBLE, my_rank-2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[5]);
    MPI_Wait(&request[5], &status[5]);
    MPI_Isend(Fmpi, Rows, MPI_DOUBLE, my_rank-2, 0, MPI_COMM_WORLD, &request[7]);
    }
    
    if ((my_rank) == 3)
    {
    MPI_Irecv(GG, Rows, MPI_DOUBLE, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[2]);
    MPI_Wait(&request[2], &status[2]);
    MPI_Isend(G, Rows, MPI_DOUBLE, my_rank-1, 0, MPI_COMM_WORLD, &request[4]);
    
    MPI_Irecv(HH, Rows, MPI_DOUBLE, my_rank-2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[6]);
    MPI_Wait(&request[6], &status[6]); 
    MPI_Isend(H, Rows, MPI_DOUBLE, my_rank-2, 0, MPI_COMM_WORLD, &request[8]);
    }
    
     //RELOADING Data (from MPI_IRecv array to array used to compute at next time step)
     for (Columns, ++) 
     {
     for (Rows, ++) 
     {
     if (my_rank == 0)
     {
     if (Column == 47)
     {
     V[Column][Row]= A[i];
     }
     if (Row == 47)
     {
     V[Column][Row]=B[i];
     }
      }
    
      ….
     //PRINT TO OUTPUT FILE AT CERTAIN POINT
     printval = 100; 
     if ((printdata>=printval))
     {
     prttofile ();
     printdata = 0;
     }
     printdata = printdata+1;
     compute_dt (); 
    
     }//CLOSE ALL TIME STEPS
    
     MPI_Finalize ();
    
      }//CLOSE MAIN
    

2 个答案:

答案 0 :(得分:4)

您是否多次调用MPI_Irecv?如果是这样,您可能没有意识到每个调用都分配了一个请求句柄 - 当收到消息并通过(例如)MPI_Test测试完成时,这些句柄被释放。您可能会因为过度使用MPI_Irecv而耗尽内存 - 或者为此目的由MPI实现分配的内存。

只有查看代码才能确认问题。

答案 1 :(得分:0)

既然代码已经添加到问题中:这确实是脏代码。您只需等待来自 Irecv 调用的请求。是的,如果收到消息,您就知道发送已完成,因此您不必等待。但是跳过等待会导致内存泄漏:Isend 分配了一个新请求,Wait 将释放该请求。因为你从不等待,所以你没有释放,你有内存泄漏。