mpi_comm_spawn错误:MPI应用程序等级0在具有信号11的MPI_Finalize()之前被杀死

时间:2013-02-08 03:20:17

标签: c mpi

我正在尝试使用MPI_Comm_Spawn运行mpi程序。我产生了1个工作程序,然后在两个程序中调用MPI_reduce,以添加一些结果。出于某种原因,应用程序挂起在MPI_Comm_spawn,然后在一分钟后中止。生成的进程只会进入其代码段,在此代码段之后调用MPI_reduce。然后,应用程序继续挂起,然后在命令提示符中提供更多错误。 应该发生的是生成的和主程序都到达MPI_Reduce调用,并且主程序得到一个总和,并输出该总和。

这是输出,我放了一个<>它的MPI输出,而不是我自己的

world size = 1   
About to call MPI_Comm_spawn with 2 workers...   
parent result is 3.141668952    
numDarts for child: 500000000  
argv[1] = 500000000  
<>MPI Application rank 0 killed before MPI_Finalize() with signal 11  
spawned process got result: 3.141668952  
Spawned process about to send message back to parent  
<>piworker: Rank 1:0: MPI_Finalize: IBV connection to 0 on card 0 is broken
<>piworker: Rank 1:0: MPI_Finalize: ibv_poll_cq(): bad status 12  
<>piworker: Rank 1:0: MPI_Finalize: self n93 peer n93 (rank: 0)  
<>piworker: Rank 1:0: MPI_Finalize: error message: transport retry exceeded error

这是主程序的代码:

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "globals.h"


int randSign();
double randFloat();
double dboard();


int main(int argc, char *argv[])
{
    int world_size, flag;
    MPI_Comm everyone;           /* intercommunicator */
    char worker_program[100];
    int universe_size;

    //  MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &universe_size, &flag);
    //   printf("universe size: %i\n", universe_size);

    int numDarts = 1000000000;
    int numWorkers = 2;

    char* args[1];
    if(argc >= 2)
    {
      numWorkers = atoi(argv[1]);
    }
      if(argc >= 3)
    numDarts = atoi(argv[2]);

   MPI_Init(&argc, &argv);
   MPI_Comm_size(MPI_COMM_WORLD, &world_size);

   printf("world size = %i\n", world_size);
   if (world_size != 1)
        printf("Top heavy with management\n");

   int numDartsWorker = numDarts/numWorkers;
   int numDartsMaster = numDarts/numWorkers + (numDarts % numWorkers); //the master computes the leftover
   args[0] = malloc(256 * sizeof(char));
   sprintf(args[0], "%i", numDartsWorker);
   printf("argument passing to workers: %s\n", args[0]);
   /*
    * Now spawn the workers. Note that there is a run-time determination
    * of what type of worker to spawn, and presumably this calculation must
    * be done at run time and cannot be calculated before starting
    * the program. If everything is known when the application is
    * first started, it is generally better to start them all at once
    * in a single MPI_COMM_WORLD.
    */
   printf("About to call MPI_Comm_spawn with %i workers...\n", numWorkers);
   int resultLen = 0;

   double myresult = dboard(numDartsMaster);
   printf("parent result is %.9f\n", myresult);


   //the master counts as a worker, hence the -1
   MPI_Comm_spawn("piworker", args, numWorkers-1, MPI_INFO_NULL, 0, MPI_COMM_SELF,
                   &everyone, MPI_ERRCODES_IGNORE);

   double pisum = 24;
   int rc = MPI_Reduce(&myresult, &pisum, 1, MPI_DOUBLE, MPI_SUM, 0, everyone);

   if (rc != MPI_SUCCESS)
        printf("failure on mpi_reduce\n");

   free(args);
   /*
    * Parallel code here. The communicator "everyone" can be used
    * to communicate with the spawned processes, which have ranks 0,..
    * MPI_UNIVERSE_SIZE-1 in the remote group of the intercommunicator
    * "everyone".
    */

   //receive the results
   int i=1;
   MPI_Status status;
   double avgpi = pisum/(double)numWorkers;
   printf("With %i workers, %i darts, estimated value of pi is: %.9f\n", numWorkers, numDarts, avgpi);

   MPI_Finalize();
   return 0;
}

工人(衍生)程序的代码

int main(int argc, char *argv[])
{
   int size;
   MPI_Comm parent;
   MPI_Init(&argc, &argv);
   MPI_Comm_get_parent(&parent);
   if (parent == MPI_COMM_NULL)
        printf("No parent!");
   int taskid;
   MPI_Comm_remote_size(parent, &size);
   MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
   double pisum = 0;
   int resultLen = 0;
   char parentName[256];
   int numDarts;


   if (size != 1)
   {
        printf("Something's wrong with the parent");
        return 1;
   }
   /*
    * Parallel code here.
    * The manager is represented as the process with rank 0 in (the remote
    * group of) the parent communicator.  If the workers need to communicate
    * among themselves, they can use MPI_COMM_WORLD.
    */
   if(argc >= 2)
        numDarts = atoi(argv[1]);
   else
   {
      printf("Error for: %i, number of darts not specified.\n", taskid);
   }
   printf("numDarts for child: %i\n", numDarts);
   printf("argv[1] = %s\n", argv[1]);
   double myPiSum = dboard(numDarts);
   printf("spawned process got result: %.9f\n", myPiSum);
   printf("Spawned process about to send message back to parent\n");
 //MPI_Send((void *)&myPiSum, 1, MPI_DOUBLE, 0, 1, parent);

   int rc = MPI_Reduce(&myPiSum, &pisum, 1, MPI_DOUBLE, MPI_SUM, 0, parent);
   if(rc != MPI_SUCCESS)
        printf("%d: Problem with mpi_reduce\n");


   printf("Sent message back to parent");
   MPI_Finalize();
   return 0;
}

希望,对于有这方面经验的人来说,这个原因会更明显。我一直在尝试各种各样的事情,这就是为什么我有这么多的printf电话。

1 个答案:

答案 0 :(得分:2)

问题是由于free()

的使用不正确,主进程终止
char* args[1];
...
args[0] = malloc(256 * sizeof(char));
...
free(args);

您正在尝试释放非堆(堆栈)内存,而free(args)会触发现代glibc版本的中止。正确的调用应该是:

free(args[0]);

除此之外,MPI_Reduce在使用内部通信器调用时不会按照预期的方式工作。您必须更改主代码,以便将MPI_ROOT作为根参数传递给MPI_Reduce,然后您必须手动添加主值,因为它在减少期间不会被使用(仅来自进程中的值)远程组正在减少 - 见here)。