MPI崩溃与解锁发送和recv

时间:2017-03-08 18:58:21

标签: mpi

我已经用MPI编写了一个求解器,但当群集上的任务数大于80时,它会崩溃或挂起。有时,它会随着测试代码的更改和任务编号的更改而崩溃,有时甚至会出现问题。起初我以为在失败点之前可能会有一些内存泄漏导致失败。但经过一些测试,我发现即使我只在解算器的开头做一个简单的数据传输,它也会崩溃。这次它只会在没有挂起的情况下崩溃。

我的问题是:

  1. 子程序TestMPI()中是否有任何错误导致崩溃?

  2. 如果第一个问题的答案是否定的,那么这次崩溃的可能原因是什么?

  3. 谢谢!

    我将解算器附加如下,数据传输的功能是:

    void TestMPI()
    {
        int check = 1;
        MPI_Bcast(&check, 1, MPI_INT, 0, MPI_COMM_WORLD);  
    
        int N_region = 0;
        MPI_Comm_size(MPI_COMM_WORLD, &N_region);
        int mpi_Rank=0;
        MPI_Comm_rank(MPI_COMM_WORLD, &mpi_Rank);
    
        MPI_Request * reqSend_test = new MPI_Request [N_region];
        MPI_Request * reqRecv_test = new MPI_Request [N_region];
        MPI_Status * status_test = new MPI_Status [N_region];
    
        MPI_Status statRecv;
    
        int cnt_test(0);
        double * Read_test = new double[N_region];
        double * Send_test = new double[N_region];
        for (int ii=0; ii<N_region; ii++)
        {
            if (ii == mpi_Rank)
                continue;
            int tag = (ii) * N_region + mpi_Rank;
            MPI_Irecv(&Read_test[ii],1,MPI_DOUBLE,ii,tag,MPI_COMM_WORLD,&reqRecv_test[ii]);
            cnt_test++;
        }
    
        cnt_test = 0;
        for (int ii=0; ii<N_region; ii++)
        {
            if (ii == mpi_Rank)
                continue;
    
            Send_test[ii] = mpi_Rank;
            int tag = (mpi_Rank)*N_region + ii;
            MPI_Isend(&Send_test[ii],1,MPI_DOUBLE,ii,tag,MPI_COMM_WORLD,&reqSend_test[ii]);
            cnt_test++;
        }
    
        //MPI_Waitall(N_region-1, reqSend_test, status_test);
    
        char fname [80];
        sprintf(fname, "TestMPI_result%d", mpi_Rank);
        FILE * stream = fopen(fname,"w");
    
        fprintf(stream, "After MPI_Waitall send\n");
        fflush(stream);
    
        for (int ii=0; ii<N_region; ii++)
        {
            if (ii == mpi_Rank)
                continue;
            MPI_Wait(&reqSend_test[ii], &statRecv);
            fprintf(stream, "After Wait Send %d\n", ii);
            fflush(stream);
        }
    
        for (int ii=0; ii<N_region; ii++)
        {
            if (ii == mpi_Rank)
                continue;
            MPI_Wait(&reqRecv_test[ii], &statRecv);
            fprintf(stream, "After Wait Recv %d\n", ii);
            fflush(stream);
        }
    
        fprintf(stream, "After Start Test\n");
        fflush(stream);
    
        //MPI_Waitall(N_region-1, reqRecv_test, status_test);
    
        MPI_Bcast(&check, 1, MPI_INT, 0, MPI_COMM_WORLD);  
    
        fprintf(stream, "After MPI_Bcast\n");
        fflush(stream);
    
        fclose(stream);
    }
    

    调用此数据传输函数的主函数是(调用TestMPI()之前的代码是为从属任务创建模拟文件夹。)

        int main(int argc, char* argv[])
        {
    
            int ISV_LIC = 17143112; // For Platform Computing mpi initialization
            MPI_Initialized(&ISV_LIC);
    
            int threadingUsed = 0;
            MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &threadingUsed);
    
            int mpi_Rank=0;
            MPI_Comm_rank(MPI_COMM_WORLD, &mpi_Rank);
    
            if (mpi_Rank == 0)
            {
                std::cerr << "solverok" << std::endl;
                std::cerr.flush();
                AnsDebug(ACHAR("3dtds-main"), 1, ACHAR("After solverok\n"));
            }
    
            AString simulationDir_master;
            AString simulationDir;
            AString tempDir;
    
            simulationDir_master=argv[argc-3]; 
            ANSOFT_CHDIR(simulationDir_master.ANS_ANSI().Str()); //simulationDir_master;
    
            mpi_BcastEnvVariables();
    
            bool no_temp_dir_override;
            no_temp_dir_override = false;
            for (int ii=0; ii<argc; ii++)
            {
                if (strcmp(argv[ii], "no_temp_dir_override") == 0)
                {
                    no_temp_dir_override = true;
                    break;
                }
            }
    
            AString Dbname;
            Dbname=argv[5];
    
            AString versionedProductName;
            versionedProductName = argv[argc-4];  
    
            AString InstallDir;
            char InstallDir_c[MAX_PATH];
            ANSOFT_GETCWD(InstallDir_c, MAX_PATH);
            InstallDir=AString(InstallDir_c);
    
            if (mpi_Rank != 0)
            {           
                if (!no_temp_dir_override)
                {
                    tempDir=argv[argc-1];
                }else{
                    RegistryAccessNgMaxwell reg;
                    tempDir = reg.GetTempDirectory_Reg(versionedProductName,InstallDir);
                    tempDir =  tempDir.Left(tempDir.size()-1);
                }
    
        #ifndef NDEBUG
                    tempDir = "E:/temp";
        #endif  
                CreateSimulationDir(mpi_Rank, tempDir, Dbname, ACHAR("maxwell"), simulationDir);
                ANSOFT_CHDIR(simulationDir.ANS_ANSI().Str());
            }
    
            char fname [80];
            sprintf(fname, "RecordMPI%d",mpi_Rank);
            FILE * stream = fopen(fname,"w");
    
            fprintf(stream, "Before testMPI\n");
            fflush(stream);
    
            TestMPI();
    
            fprintf(stream, "After testMPI\n");
            fflush(stream);
    
            fclose(stream);
    
            MPI_Finalize();
        }
    

    任务79的输出文件如下所示,显示了崩溃点。

        After MPI_Waitall send
    After Wait Send 0
    After Wait Send 1
    After Wait Send 2
    After Wait Send 3
    After Wait Send 4
    After Wait Send 5
    After Wait Send 6
    After Wait Send 7
    After Wait Send 8
    After Wait Send 9
    After Wait Send 10
    After Wait Send 11
    After Wait Send 12
    After Wait Send 13
    After Wait Send 14
    After Wait Send 15
    After Wait Send 16
    After Wait Send 17
    After Wait Send 18
    After Wait Send 19
    After Wait Send 20
    After Wait Send 21
    After Wait Send 22
    After Wait Send 23
    After Wait Send 24
    After Wait Send 25
    After Wait Send 26
    After Wait Send 27
    After Wait Send 28
    After Wait Send 29
    After Wait Send 30
    After Wait Send 31
    After Wait Send 32
    After Wait Send 33
    After Wait Send 34
    After Wait Send 35
    After Wait Send 36
    After Wait Send 37
    After Wait Send 38
    After Wait Send 39
    After Wait Send 40
    After Wait Send 41
    After Wait Send 42
    After Wait Send 43
    After Wait Send 44
    After Wait Send 45
    After Wait Send 46
    After Wait Send 47
    After Wait Send 48
    After Wait Send 49
    After Wait Send 50
    After Wait Send 51
    After Wait Send 52
    After Wait Send 53
    After Wait Send 54
    After Wait Send 55
    After Wait Send 56
    After Wait Send 57
    After Wait Send 58
    After Wait Send 59
    After Wait Send 60
    After Wait Send 61
    After Wait Send 62
    After Wait Send 63
    After Wait Send 64
    After Wait Send 65
    After Wait Send 66
    After Wait Send 67
    After Wait Send 68
    After Wait Send 69
    After Wait Send 70
    After Wait Send 71
    After Wait Send 72
    After Wait Send 73
    After Wait Send 74
    After Wait Send 75
    After Wait Send 76
    After Wait Send 77
    After Wait Send 78
    After Wait Send 80
    After Wait Send 81
    After Wait Send 82
    After Wait Send 83
    After Wait Send 84
    After Wait Send 85
    After Wait Send 86
    After Wait Send 87
    After Wait Send 88
    After Wait Send 89
    After Wait Send 90
    After Wait Send 91
    After Wait Send 92
    After Wait Send 93
    After Wait Send 94
    After Wait Send 95
    After Wait Send 96
    After Wait Send 97
    After Wait Send 98
    After Wait Send 99
    After Wait Send 100
    After Wait Send 101
    After Wait Send 102
    After Wait Send 103
    After Wait Send 104
    After Wait Recv 0
    After Wait Recv 1
    After Wait Recv 2
    After Wait Recv 3
    After Wait Recv 4
    After Wait Recv 5
    After Wait Recv 6
    After Wait Recv 7
    After Wait Recv 8
    After Wait Recv 9
    After Wait Recv 10
    After Wait Recv 11
    After Wait Recv 12
    After Wait Recv 13
    After Wait Recv 14
    After Wait Recv 15
    After Wait Recv 16
    After Wait Recv 17
    After Wait Recv 18
    After Wait Recv 19
    After Wait Recv 20
    After Wait Recv 21
    After Wait Recv 22
    After Wait Recv 23
    After Wait Recv 24
    After Wait Recv 25
    After Wait Recv 26
    After Wait Recv 27
    After Wait Recv 28
    After Wait Recv 29
    After Wait Recv 30
    After Wait Recv 31
    

0 个答案:

没有答案