Question

我一直在使用PETSc库编写一些代码，现在我要将其中的一部分更改为并行运行。我想要并行化的大多数事情是矩阵初始化以及我生成和计算大量值的部分。无论如何，如果因为某些原因运行代码的所有部分将运行的次数与我使用的内核数量相同，我的问题就会出现。

这只是我测试PETSc和MPI的简单示例代码

int main(int argc, char** argv)
{
    time_t rawtime;
    time ( &rawtime );
    string sta = ctime (&rawtime);
    cout << "Solving began..." << endl;

PetscInitialize(&argc, &argv, 0, 0);

  Mat            A;        /* linear system matrix */
  PetscInt       i,j,Ii,J,Istart,Iend,m = 120000,n = 3,its;
  PetscErrorCode ierr;
  PetscBool      flg = PETSC_FALSE;
  PetscScalar    v;
#if defined(PETSC_USE_LOG)
  PetscLogStage  stage;
#endif

  /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
         Compute the matrix and right-hand-side vector that define
         the linear system, Ax = b.
     - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
  /* 
     Create parallel matrix, specifying only its global dimensions.
     When using MatCreate(), the matrix format can be specified at
     runtime. Also, the parallel partitioning of the matrix is
     determined by PETSc at runtime.

     Performance tuning note:  For problems of substantial size,
     preallocation of matrix memory is crucial for attaining good 
     performance. See the matrix chapter of the users manual for details.
  */
  ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
  ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m,n);CHKERRQ(ierr);
  ierr = MatSetFromOptions(A);CHKERRQ(ierr);
  ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
  ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
  ierr = MatSetUp(A);CHKERRQ(ierr);

  /* 
     Currently, all PETSc parallel matrix formats are partitioned by
     contiguous chunks of rows across the processors.  Determine which
     rows of the matrix are locally owned. 
  */
  ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);

  /* 
     Set matrix elements for the 2-D, five-point stencil in parallel.
      - Each processor needs to insert only elements that it owns
        locally (but any non-local elements will be sent to the
        appropriate processor during matrix assembly). 
      - Always specify global rows and columns of matrix entries.

     Note: this uses the less common natural ordering that orders first
     all the unknowns for x = h then for x = 2h etc; Hence you see J = Ii +- n
     instead of J = I +- m as you might expect. The more standard ordering
     would first do all variables for y = h, then y = 2h etc.

   */
PetscMPIInt    rank;        // processor rank
PetscMPIInt    size;        // size of communicator
MPI_Comm_rank(PETSC_COMM_WORLD,&rank);
MPI_Comm_size(PETSC_COMM_WORLD,&size);

cout << "Rank = " << rank << endl;
cout << "Size = " << size << endl;

cout << "Generating 2D-Array" << endl;

double temp2D[120000][3];
 for (Ii=Istart; Ii<Iend; Ii++) { 
    for(J=0; J<n;J++){
      temp2D[Ii][J] = 1;
    }
  }
  cout << "Processor " << rank << " set values : " << Istart << " - " << Iend << " into 2D-Array" << endl;

  v = -1.0;
  for (Ii=Istart; Ii<Iend; Ii++) { 
    for(J=0; J<n;J++){
       MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
   }
  }
  cout << "Ii = " << Ii << " processor " << rank << " and it owns: " << Istart << " - " << Iend << endl;

  /* 
     Assemble matrix, using the 2-step process:
       MatAssemblyBegin(), MatAssemblyEnd()
     Computations can be done while messages are in transition
     by placing code between these two statements.
  */
  ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
  ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);

    MPI_Finalize();
cout << "No more MPI" << endl;
return 0;

}

我的真实程序有几个不同的.cpp文件。我在主程序中初始化MPI，在另一个.cpp文件中调用一个函数，我在那里实现了相同类型的矩阵填充，但是在填充矩阵之前程序所做的所有cout都将打印出与我的核心数一样多的次数。

我可以运行我的测试程序作为mpiexec -n 4测试并且它成功运行但由于某种原因我必须运行我的真实程序为mpiexec -n 4 ./myprog

我的测试程序的输出如下

Solving began...
Solving began...
Solving began...
Solving began...
Rank = 0
Size = 4
Generating 2D-Array
Processor 0 set values : 0 - 30000 into 2D-Array
Rank = 2
Size = 4
Generating 2D-Array
Processor 2 set values : 60000 - 90000 into 2D-Array
Rank = 3
Size = 4
Generating 2D-Array
Processor 3 set values : 90000 - 120000 into 2D-Array
Rank = 1
Size = 4
Generating 2D-Array
Processor 1 set values : 30000 - 60000 into 2D-Array
Ii = 30000 processor 0 and it owns: 0 - 30000
Ii = 90000 processor 2 and it owns: 60000 - 90000
Ii = 120000 processor 3 and it owns: 90000 - 120000
Ii = 60000 processor 1 and it owns: 30000 - 60000
no more MPI
no more MPI
no more MPI
no more MPI

两条评论后编辑：所以我的目标是在具有20个节点且每个节点有2个核心的小型集群上运行它。后来应该在超级计算机上运行所以mpi绝对是我需要的方式。我目前正在两台不同的机器上进行测试，其中一台机器有1个处理器/ 4个核心，第二台机器有4个处理器/ 16个核心。

Answer 1

MPI是SPMD / MPMD模型的实现（单个程序多个数据/多个程序多个数据）。 MPI作业包括同时运行进程，这些进程在彼此之间交换消息，以便合作解决问题。您不能并行运行部分代码。您只能让部分代码不能相互通信但仍然可以并发执行。您应该使用mpirun或mpiexec以并行模式启动您的应用程序。

如果您只想使代码的一部分并行，并且可以忍受只能在一台机器上运行代码的限制，那么您需要的是OpenMP而不是MPI。或者您也可以根据PETSc网站使用低级POSIX线程编程，它支持pthreads。 OpenMP建立在pthreads之上，因此可以使用PETSc和OpenMP。

Answer 2

为了增加Hristo的答案，MPI被构建为以分布式方式运行，即完全独立的进程。它们必须是分开的，因为它们应该在不同的物理机器上。您可以在一台计算机上运行多个MPI进程，例如每个核心一个。这完全没问题，但是MPI没有任何工具可以利用共享内存环境。换句话说，你不能让一些MPI等级（进程）对另一个MPI进程拥有的矩阵起作用，因为你无法共享矩阵。

当您启动x MPI进程时，您将获得运行相同精确程序的x个副本。你需要像

这样的代码

if (rank == 0)
    do something
else
    do something else

让不同的流程做不同的事情。这些进程可以通过发送消息相互通信，但它们都运行相同的二进制文件。如果您没有代码分歧，那么您只需获得相同程序的x个副本就可以得到相同的结果x次。

C ++和MPI如何将部分代码写成并行？

2 个答案: