Question

我有一个由其他人写的MPI程序。

基本结构是这样的

program basis

initialize MPI

do n=1,12

    call mpi_job(n)  

end do 

finalize MPI

contains

subroutine mpi_job(n)  !this is mpi subroutine
.....
end subroutine

end program

我现在要做的是让do循环成为并行do循环。因此，如果我有一台24核计算机，我可以运行此程序，同时运行12 mpi_job，每个mpi_job使用2个线程。这样做有几个原因，例如，mpi_job的性能可能无法与核心数量很好地扩展。 总之，我想将一级MPI并行化分为两个并行化级别。

当我与其他人合作时，我发现自己经常遇到这个问题。问题是修改程序的最简单有效的方法是什么？

Answer 1

所以如果我有一台24核机器，我可以运行这个程序，同时运行12 mpi_job，每个mpi_job使用2个线程。

我不会这样做。我建议将MPI进程映射到NUMA节点，然后生成k个线程，其中每个NUMA节点有k个核心。

有几个原因可以做到这一点，例如，mpi_job的性能可能无法很好地扩展核心数量。

这是一个完全不同的问题。 mpi_job的哪些方面不会很好地扩展？是内存绑定吗？是否需要过多的沟通？

Answer 2

你使用应该使用子通信器。

计算job_nr = floor(global_rank / ranks_per_job)
在MPI_COMM_SPLIT上使用job_nr。这为每个工作创建了一个本地的通信器
将生成的通讯器传递给mpi_job。然后，所有通信都应该使用该通信器和该通信器的本地级别。

当然，这一切都意味着对mpi_job的不同调用之间没有依赖关系 - 或者您将其映射到适当的全球/世界通信器。

Answer 3

这里有一些关于你要做什么的基础知识的混乱。您的骨架代码不会同时运行12个MPI作业;您创建的每个MPI流程将按顺序运行12个作业。

您要做的是运行12个MPI进程，每个进程一次调用mpi_job。在mpi_job中，您可以使用OpenMP创建2个线程。

进程和线程放置超出了MPI和OpenMP标准的范围。例如，确保进程在您的多核计算机上均匀分布（例如12个偶数核心中的每个核心，2个，...中的24个）以及OpenMP线程在偶数和奇数对核心上运行将需要您查找MPI和OpenMP实现的手册页。您可以使用参数将进程放入mpiexec;线程放置可以由环境变量控制，例如英特尔OpenMP的KMP_AFFINITY。

除了放置，这里有一个代码，我认为你做了什么（我不评论它是否是最有效的事情）。我在这里使用GNU编译器。

user@laptop$ mpif90 -fopenmp -o basis basis.f90
user@laptop$ export OMP_NUM_THREADS=2
user@laptop$ mpiexec -n 12 ./basis
 Running           12  MPI jobs at the same time
 MPI job            2 , thread no.            1  reporting for duty
 MPI job           11 , thread no.            1  reporting for duty
 MPI job           11 , thread no.            0  reporting for duty
 MPI job            8 , thread no.            0  reporting for duty
 MPI job            0 , thread no.            1  reporting for duty
 MPI job            0 , thread no.            0  reporting for duty
 MPI job            2 , thread no.            0  reporting for duty
 MPI job            8 , thread no.            1  reporting for duty
 MPI job            4 , thread no.            1  reporting for duty
 MPI job            4 , thread no.            0  reporting for duty
 MPI job           10 , thread no.            1  reporting for duty
 MPI job           10 , thread no.            0  reporting for duty
 MPI job            3 , thread no.            1  reporting for duty
 MPI job            3 , thread no.            0  reporting for duty
 MPI job            1 , thread no.            0  reporting for duty
 MPI job            1 , thread no.            1  reporting for duty
 MPI job            5 , thread no.            0  reporting for duty
 MPI job            5 , thread no.            1  reporting for duty
 MPI job            9 , thread no.            1  reporting for duty
 MPI job            9 , thread no.            0  reporting for duty
 MPI job            7 , thread no.            0  reporting for duty
 MPI job            7 , thread no.            1  reporting for duty
 MPI job            6 , thread no.            1  reporting for duty
 MPI job            6 , thread no.            0  reporting for duty

以下是代码：

program basis

  use mpi

  implicit none

  integer :: ierr, size, rank
  integer :: comm = MPI_COMM_WORLD

  call MPI_Init(ierr)

  call MPI_Comm_size(comm, size, ierr)
  call MPI_Comm_rank(comm, rank, ierr)

  if (rank == 0) then
     write(*,*) 'Running ', size, ' MPI jobs at the same time'
  end if

  call mpi_job(rank)

  call MPI_Finalize(ierr)

contains

  subroutine mpi_job(n)  !this is mpi subroutine

    use omp_lib

    implicit none

    integer :: n, ithread

    !$omp parallel default(none) private(ithread) shared(n)

    ithread = omp_get_thread_num()

    write(*,*) 'MPI job ', n, ', thread no. ', ithread, ' reporting for duty'

    !$omp end parallel

  end subroutine mpi_job

end program basis

并行运行mpi子程序

3 个答案: