我正在尝试使用MPI_Comm_Spawn运行mpi程序。我产生了1个工作程序,然后在两个程序中调用MPI_reduce,以添加一些结果。出于某种原因,应用程序挂起在MPI_Comm_spawn,然后在一分钟后中止。生成的进程只会进入其代码段,在此代码段之后调用MPI_reduce。然后,应用程序继续挂起,然后在命令提示符中提供更多错误。 应该发生的是生成的和主程序都到达MPI_Reduce调用,并且主程序得到一个总和,并输出该总和。
这是输出,我放了一个<>它的MPI输出,而不是我自己的
world size = 1
About to call MPI_Comm_spawn with 2 workers...
parent result is 3.141668952
numDarts for child: 500000000
argv[1] = 500000000
<>MPI Application rank 0 killed before MPI_Finalize() with signal 11
spawned process got result: 3.141668952
Spawned process about to send message back to parent
<>piworker: Rank 1:0: MPI_Finalize: IBV connection to 0 on card 0 is broken
<>piworker: Rank 1:0: MPI_Finalize: ibv_poll_cq(): bad status 12
<>piworker: Rank 1:0: MPI_Finalize: self n93 peer n93 (rank: 0)
<>piworker: Rank 1:0: MPI_Finalize: error message: transport retry exceeded error
这是主程序的代码:
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "globals.h"
int randSign();
double randFloat();
double dboard();
int main(int argc, char *argv[])
{
int world_size, flag;
MPI_Comm everyone; /* intercommunicator */
char worker_program[100];
int universe_size;
// MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &universe_size, &flag);
// printf("universe size: %i\n", universe_size);
int numDarts = 1000000000;
int numWorkers = 2;
char* args[1];
if(argc >= 2)
{
numWorkers = atoi(argv[1]);
}
if(argc >= 3)
numDarts = atoi(argv[2]);
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
printf("world size = %i\n", world_size);
if (world_size != 1)
printf("Top heavy with management\n");
int numDartsWorker = numDarts/numWorkers;
int numDartsMaster = numDarts/numWorkers + (numDarts % numWorkers); //the master computes the leftover
args[0] = malloc(256 * sizeof(char));
sprintf(args[0], "%i", numDartsWorker);
printf("argument passing to workers: %s\n", args[0]);
/*
* Now spawn the workers. Note that there is a run-time determination
* of what type of worker to spawn, and presumably this calculation must
* be done at run time and cannot be calculated before starting
* the program. If everything is known when the application is
* first started, it is generally better to start them all at once
* in a single MPI_COMM_WORLD.
*/
printf("About to call MPI_Comm_spawn with %i workers...\n", numWorkers);
int resultLen = 0;
double myresult = dboard(numDartsMaster);
printf("parent result is %.9f\n", myresult);
//the master counts as a worker, hence the -1
MPI_Comm_spawn("piworker", args, numWorkers-1, MPI_INFO_NULL, 0, MPI_COMM_SELF,
&everyone, MPI_ERRCODES_IGNORE);
double pisum = 24;
int rc = MPI_Reduce(&myresult, &pisum, 1, MPI_DOUBLE, MPI_SUM, 0, everyone);
if (rc != MPI_SUCCESS)
printf("failure on mpi_reduce\n");
free(args);
/*
* Parallel code here. The communicator "everyone" can be used
* to communicate with the spawned processes, which have ranks 0,..
* MPI_UNIVERSE_SIZE-1 in the remote group of the intercommunicator
* "everyone".
*/
//receive the results
int i=1;
MPI_Status status;
double avgpi = pisum/(double)numWorkers;
printf("With %i workers, %i darts, estimated value of pi is: %.9f\n", numWorkers, numDarts, avgpi);
MPI_Finalize();
return 0;
}
工人(衍生)程序的代码
int main(int argc, char *argv[])
{
int size;
MPI_Comm parent;
MPI_Init(&argc, &argv);
MPI_Comm_get_parent(&parent);
if (parent == MPI_COMM_NULL)
printf("No parent!");
int taskid;
MPI_Comm_remote_size(parent, &size);
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
double pisum = 0;
int resultLen = 0;
char parentName[256];
int numDarts;
if (size != 1)
{
printf("Something's wrong with the parent");
return 1;
}
/*
* Parallel code here.
* The manager is represented as the process with rank 0 in (the remote
* group of) the parent communicator. If the workers need to communicate
* among themselves, they can use MPI_COMM_WORLD.
*/
if(argc >= 2)
numDarts = atoi(argv[1]);
else
{
printf("Error for: %i, number of darts not specified.\n", taskid);
}
printf("numDarts for child: %i\n", numDarts);
printf("argv[1] = %s\n", argv[1]);
double myPiSum = dboard(numDarts);
printf("spawned process got result: %.9f\n", myPiSum);
printf("Spawned process about to send message back to parent\n");
//MPI_Send((void *)&myPiSum, 1, MPI_DOUBLE, 0, 1, parent);
int rc = MPI_Reduce(&myPiSum, &pisum, 1, MPI_DOUBLE, MPI_SUM, 0, parent);
if(rc != MPI_SUCCESS)
printf("%d: Problem with mpi_reduce\n");
printf("Sent message back to parent");
MPI_Finalize();
return 0;
}
希望,对于有这方面经验的人来说,这个原因会更明显。我一直在尝试各种各样的事情,这就是为什么我有这么多的printf电话。
答案 0 :(得分:2)
问题是由于free()
:
char* args[1];
...
args[0] = malloc(256 * sizeof(char));
...
free(args);
您正在尝试释放非堆(堆栈)内存,而free(args)
会触发现代glibc
版本的中止。正确的调用应该是:
free(args[0]);
除此之外,MPI_Reduce
在使用内部通信器调用时不会按照预期的方式工作。您必须更改主代码,以便将MPI_ROOT
作为根参数传递给MPI_Reduce
,然后您必须手动添加主值,因为它在减少期间不会被使用(仅来自进程中的值)远程组正在减少 - 见here)。