我已经用MPI编写了一个求解器,但当群集上的任务数大于80时,它会崩溃或挂起。有时,它会随着测试代码的更改和任务编号的更改而崩溃,有时甚至会出现问题。起初我以为在失败点之前可能会有一些内存泄漏导致失败。但经过一些测试,我发现即使我只在解算器的开头做一个简单的数据传输,它也会崩溃。这次它只会在没有挂起的情况下崩溃。
我的问题是:
子程序TestMPI()中是否有任何错误导致崩溃?
如果第一个问题的答案是否定的,那么这次崩溃的可能原因是什么?
谢谢!
我将解算器附加如下,数据传输的功能是:
void TestMPI()
{
int check = 1;
MPI_Bcast(&check, 1, MPI_INT, 0, MPI_COMM_WORLD);
int N_region = 0;
MPI_Comm_size(MPI_COMM_WORLD, &N_region);
int mpi_Rank=0;
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_Rank);
MPI_Request * reqSend_test = new MPI_Request [N_region];
MPI_Request * reqRecv_test = new MPI_Request [N_region];
MPI_Status * status_test = new MPI_Status [N_region];
MPI_Status statRecv;
int cnt_test(0);
double * Read_test = new double[N_region];
double * Send_test = new double[N_region];
for (int ii=0; ii<N_region; ii++)
{
if (ii == mpi_Rank)
continue;
int tag = (ii) * N_region + mpi_Rank;
MPI_Irecv(&Read_test[ii],1,MPI_DOUBLE,ii,tag,MPI_COMM_WORLD,&reqRecv_test[ii]);
cnt_test++;
}
cnt_test = 0;
for (int ii=0; ii<N_region; ii++)
{
if (ii == mpi_Rank)
continue;
Send_test[ii] = mpi_Rank;
int tag = (mpi_Rank)*N_region + ii;
MPI_Isend(&Send_test[ii],1,MPI_DOUBLE,ii,tag,MPI_COMM_WORLD,&reqSend_test[ii]);
cnt_test++;
}
//MPI_Waitall(N_region-1, reqSend_test, status_test);
char fname [80];
sprintf(fname, "TestMPI_result%d", mpi_Rank);
FILE * stream = fopen(fname,"w");
fprintf(stream, "After MPI_Waitall send\n");
fflush(stream);
for (int ii=0; ii<N_region; ii++)
{
if (ii == mpi_Rank)
continue;
MPI_Wait(&reqSend_test[ii], &statRecv);
fprintf(stream, "After Wait Send %d\n", ii);
fflush(stream);
}
for (int ii=0; ii<N_region; ii++)
{
if (ii == mpi_Rank)
continue;
MPI_Wait(&reqRecv_test[ii], &statRecv);
fprintf(stream, "After Wait Recv %d\n", ii);
fflush(stream);
}
fprintf(stream, "After Start Test\n");
fflush(stream);
//MPI_Waitall(N_region-1, reqRecv_test, status_test);
MPI_Bcast(&check, 1, MPI_INT, 0, MPI_COMM_WORLD);
fprintf(stream, "After MPI_Bcast\n");
fflush(stream);
fclose(stream);
}
调用此数据传输函数的主函数是(调用TestMPI()之前的代码是为从属任务创建模拟文件夹。)
int main(int argc, char* argv[])
{
int ISV_LIC = 17143112; // For Platform Computing mpi initialization
MPI_Initialized(&ISV_LIC);
int threadingUsed = 0;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &threadingUsed);
int mpi_Rank=0;
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_Rank);
if (mpi_Rank == 0)
{
std::cerr << "solverok" << std::endl;
std::cerr.flush();
AnsDebug(ACHAR("3dtds-main"), 1, ACHAR("After solverok\n"));
}
AString simulationDir_master;
AString simulationDir;
AString tempDir;
simulationDir_master=argv[argc-3];
ANSOFT_CHDIR(simulationDir_master.ANS_ANSI().Str()); //simulationDir_master;
mpi_BcastEnvVariables();
bool no_temp_dir_override;
no_temp_dir_override = false;
for (int ii=0; ii<argc; ii++)
{
if (strcmp(argv[ii], "no_temp_dir_override") == 0)
{
no_temp_dir_override = true;
break;
}
}
AString Dbname;
Dbname=argv[5];
AString versionedProductName;
versionedProductName = argv[argc-4];
AString InstallDir;
char InstallDir_c[MAX_PATH];
ANSOFT_GETCWD(InstallDir_c, MAX_PATH);
InstallDir=AString(InstallDir_c);
if (mpi_Rank != 0)
{
if (!no_temp_dir_override)
{
tempDir=argv[argc-1];
}else{
RegistryAccessNgMaxwell reg;
tempDir = reg.GetTempDirectory_Reg(versionedProductName,InstallDir);
tempDir = tempDir.Left(tempDir.size()-1);
}
#ifndef NDEBUG
tempDir = "E:/temp";
#endif
CreateSimulationDir(mpi_Rank, tempDir, Dbname, ACHAR("maxwell"), simulationDir);
ANSOFT_CHDIR(simulationDir.ANS_ANSI().Str());
}
char fname [80];
sprintf(fname, "RecordMPI%d",mpi_Rank);
FILE * stream = fopen(fname,"w");
fprintf(stream, "Before testMPI\n");
fflush(stream);
TestMPI();
fprintf(stream, "After testMPI\n");
fflush(stream);
fclose(stream);
MPI_Finalize();
}
任务79的输出文件如下所示,显示了崩溃点。
After MPI_Waitall send
After Wait Send 0
After Wait Send 1
After Wait Send 2
After Wait Send 3
After Wait Send 4
After Wait Send 5
After Wait Send 6
After Wait Send 7
After Wait Send 8
After Wait Send 9
After Wait Send 10
After Wait Send 11
After Wait Send 12
After Wait Send 13
After Wait Send 14
After Wait Send 15
After Wait Send 16
After Wait Send 17
After Wait Send 18
After Wait Send 19
After Wait Send 20
After Wait Send 21
After Wait Send 22
After Wait Send 23
After Wait Send 24
After Wait Send 25
After Wait Send 26
After Wait Send 27
After Wait Send 28
After Wait Send 29
After Wait Send 30
After Wait Send 31
After Wait Send 32
After Wait Send 33
After Wait Send 34
After Wait Send 35
After Wait Send 36
After Wait Send 37
After Wait Send 38
After Wait Send 39
After Wait Send 40
After Wait Send 41
After Wait Send 42
After Wait Send 43
After Wait Send 44
After Wait Send 45
After Wait Send 46
After Wait Send 47
After Wait Send 48
After Wait Send 49
After Wait Send 50
After Wait Send 51
After Wait Send 52
After Wait Send 53
After Wait Send 54
After Wait Send 55
After Wait Send 56
After Wait Send 57
After Wait Send 58
After Wait Send 59
After Wait Send 60
After Wait Send 61
After Wait Send 62
After Wait Send 63
After Wait Send 64
After Wait Send 65
After Wait Send 66
After Wait Send 67
After Wait Send 68
After Wait Send 69
After Wait Send 70
After Wait Send 71
After Wait Send 72
After Wait Send 73
After Wait Send 74
After Wait Send 75
After Wait Send 76
After Wait Send 77
After Wait Send 78
After Wait Send 80
After Wait Send 81
After Wait Send 82
After Wait Send 83
After Wait Send 84
After Wait Send 85
After Wait Send 86
After Wait Send 87
After Wait Send 88
After Wait Send 89
After Wait Send 90
After Wait Send 91
After Wait Send 92
After Wait Send 93
After Wait Send 94
After Wait Send 95
After Wait Send 96
After Wait Send 97
After Wait Send 98
After Wait Send 99
After Wait Send 100
After Wait Send 101
After Wait Send 102
After Wait Send 103
After Wait Send 104
After Wait Recv 0
After Wait Recv 1
After Wait Recv 2
After Wait Recv 3
After Wait Recv 4
After Wait Recv 5
After Wait Recv 6
After Wait Recv 7
After Wait Recv 8
After Wait Recv 9
After Wait Recv 10
After Wait Recv 11
After Wait Recv 12
After Wait Recv 13
After Wait Recv 14
After Wait Recv 15
After Wait Recv 16
After Wait Recv 17
After Wait Recv 18
After Wait Recv 19
After Wait Recv 20
After Wait Recv 21
After Wait Recv 22
After Wait Recv 23
After Wait Recv 24
After Wait Recv 25
After Wait Recv 26
After Wait Recv 27
After Wait Recv 28
After Wait Recv 29
After Wait Recv 30
After Wait Recv 31