我想知道是否有人可以提供解释。
我将从代码开始:
/*
Barrier implemented using tournament-style coding
*/
// Constraints: Number of processes must be a power of 2, e.g.
// 2,4,8,16,32,64,128,etc.
#include <mpi.h>
#include <stdio.h>
#include <unistd.h>
void mybarrier(MPI_Comm);
// global debug bool
int verbose = 1;
int main(int argc, char * argv[]) {
int rank;
int size;
int i;
int sum = 0;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int check = size;
// check to make sure the number of processes is a power of 2
if (rank == 0){
while(check > 1){
if (check % 2 == 0){
check /= 2;
} else {
printf("ERROR: The number of processes must be a power of 2!\n");
MPI_Abort(MPI_COMM_WORLD, 1);
return 1;
}
}
}
// simple task, with barrier in the middle
for (i = 0; i < 500; i++){
sum ++;
}
mybarrier(MPI_COMM_WORLD);
for (i = 0; i < 500; i++){
sum ++;
}
if (verbose){
printf("process %d arrived at finalize\n", rank);
}
MPI_Finalize();
return 0;
}
void mybarrier(MPI_Comm comm){
// MPI variables
int rank;
int size;
int * data;
MPI_Status * status;
// Loop variables
int i;
int a;
int skip;
int complete = 0;
int currentCycle = 1;
// Initialize MPI vars
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);
// step 1, gathering
while (!complete){
skip = currentCycle * 2;
// if currentCycle divides rank evenly, then it is a target
if ((rank % currentCycle) == 0){
// if skip divides rank evenly, then it needs to receive
if ((rank % skip) == 0){
MPI_Recv(data, 0, MPI_INT, rank + currentCycle, 99, comm, status);
if (verbose){
printf("1: %d from %d\n", rank, rank + currentCycle);
}
// otherwise, it needs to send. Once sent, the process is done
} else {
if (verbose){
printf("1: %d to %d\n", rank, rank - currentCycle);
}
MPI_Send(data, 0, MPI_INT, rank - currentCycle, 99, comm);
complete = 1;
}
}
currentCycle *= 2;
// main process will never send, so this code will allow it to complete
if (currentCycle >= size){
complete = 1;
}
}
complete = 0;
currentCycle = size / 2;
// step 2, scattering
while (!complete){
// if currentCycle is 1, then this is the last loop
if (currentCycle == 1){
complete = 1;
}
skip = currentCycle * 2;
// if currentCycle divides rank evenly then it is a target
if ((rank % currentCycle) == 0){
// if skip divides rank evenly, then it needs to send
if ((rank % skip) == 0){
if (verbose){
printf("2: %d to %d\n", rank, rank + currentCycle);
}
MPI_Send(data, 0, MPI_INT, rank + currentCycle, 99, comm);
// otherwise, it needs to receive
} else {
if (verbose){
printf("2: %d waiting for %d\n", rank, rank - currentCycle);
}
MPI_Recv(data, 0, MPI_INT, rank - currentCycle, 99, comm, status);
if (verbose){
printf("2: %d from %d\n", rank, rank - currentCycle);
}
}
}
currentCycle /= 2;
}
}
代码是将总和增加到500,等待所有其他进程使用阻塞MPI_Send和MPI_Recv调用到达该点,然后将sum增加到1000.
群集按预期行事
main函数中的所有进程都报告为99,我已经专门链接到mybarrier的第二个while循环的标记。
我的第一个草稿是用for循环编写的,并且在那个程序中,程序也按预期在集群上执行,但是在我的机器上执行永远不会完成,即使所有进程都调用MPI_Finalize(但没有超出它)。
我的机器正在运行OpenRTE 2.0.2 群集正在运行OpenRTE 1.6.3
我观察到我的机器似乎一直意外运行,而集群正常运行。对于我编写的其他MPI代码也是如此。我不知道1.6.3和2.0.2之间是否有重大变化?
无论如何,我感到困惑,我想知道是否有人可以解释为什么我的机器似乎没有正确运行MPI。我希望我提供了足够的细节,但如果没有,我很乐意提供您需要的任何其他信息。
答案 0 :(得分:3)
您的代码存在问题,可能是导致您看到的奇怪行为的原因。
您正在向MPI_Recv
例程传递尚未分配的status
对象。实际上,该指针甚至没有被初始化,因此如果它不是NULL
,则MPI_Recv
将最终写入内存中的任何位置,从而导致未定义的行为。正确的形式如下:
MPI_Status status;
...
MPI_Recv(..., &status);
或者如果你想使用堆:
MPI_Status *status = malloc(sizeof(MPI_Status));
...
MPI_Recv(..., status);
...
free(status);
此外,由于您没有使用接收返回的值,因此您应改为使用MPI_STATUS_IGNORE
:
MPI_Recv(..., MPI_STATUS_IGNORE);