Question

我正在并行化一个fortran代码，它在无MPI版本中没有问题。以下是代码的摘录。

每个处理器执行以下操作：

对于一定数量的粒子，它会在循环中产生一定量的数量。＆＃34;做203＆＃34 ;;在以Nint子间隔（j = 1，Nint）划分的给定间隔中，每个处理器产生向量Nx1（j），Nx2（j）的元素。
然后，向量Nx1（j），Nx2（j）被发送到根（mype = 0），其在每个子区间（j = 1，Nint）对每个处理器的所有贡献求和：Nx1（j）来自来自处理器2的处理器1 + Nx1（j）....每个j（每个子区间）的值的根和，并产生Nx5（j），Nx6（j）。

另一个问题是，如果我解除分配变量，代码在计算结束后仍然处于待机状态而不完成执行;但我不知道这是否与MPI_Allreduce问题有关。

    include "mpif.h"
    ...
    integer*4 ....
    ...
    real*8 
    ...
    call MPI_INIT(mpierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD, npe, mpierr)
    call MPI_COMM_RANK(MPI_COMM_WORLD, mype, mpierr)

!       Allocate variables
    allocate(Nx1(Nint),Nx5(Nint))
    ...

!       Parameters
    ...

    call MPI_Barrier (MPI_COMM_WORLD, mpierr)

!   Loop on particles

    do 100 npartj=1,npart_local

     call init_random_seed() 
     call random_number (rand)

    ...
    Initial condition
    ... 
    do 203 i=1,1000000  ! loop for time evolution of single particle

        if(ufinp.gt.p1.and.ufinp.le.p2)then 
         do j=1,Nint  ! spatial position at any momentum
          ls(j) = lb+(j-1)*Delta/Nint !Left side of sub-interval across shock
          rs(j) = ls(j)+Delta/Nint
          if(y(1).gt.ls(j).and.y(1).lt.rs(j))then !position-ordered
            Nx1(j)=Nx1(j)+1 
          endif 
         enddo
        endif
       if(ufinp.gt.p2.and.ufinp.le.p3)then 
        do j=1,Nint  ! spatial position at any momentum
          ls(j) = lb+(j-1)*Delta/Nint !Left side of sub-interval across shock
          rs(j) = ls(j)+Delta/Nint
          if(y(1).gt.ls(j).and.y(1).lt.rs(j))then !position-ordered
            Nx2(j)=Nx2(j)+1 
          endif 
        enddo
       endif
203  continue 
100    continue     
    call MPI_Barrier (MPI_COMM_WORLD, mpierr)

    print*,"To be summed"
    do j=1,Nint
       call MPI_ALLREDUCE (Nx1(j),Nx5(j),npe,mpi_integer,mpi_sum,
     &      MPI_COMM_WORLD, mpierr)
           call MPI_ALLREDUCE (Nx2(j),Nx6(j),npe,mpi_integer,mpi_sum,
     &          MPI_COMM_WORLD, mpierr)
     enddo 

    if(mype.eq.0)then
     do j=1,Nint
       write(1,107)ls(j),Nx5(j),Nx6(j)
     enddo 
107  format(3(F13.2,2X,i6,2X,i6))   
    endif 
    call MPI_Barrier (MPI_COMM_WORLD, mpierr)
    print*,"Now deallocate"
!   deallocate(Nx1)  !inserting the de-allocate
!   deallocate(Nx2)

    close(1)

    call MPI_Finalize(mpierr)

    end


!  Subroutines
    ...

Answer 1

然后，向量Nx1（j），Nx2（j）被发送到根（mype = 0），其在每个子区间（j = 1，Nint）对每个处理器的所有贡献求和：Nx1（j）来自来自处理器2的处理器1 + Nx1（j）....每个j（每个子区间）的值的根和，并产生Nx5（j），Nx6（j）。

这不是allreduce的作用。减少意味着在所有过程中并行完成求和。 allreduce意味着所有过程都会得到求和的结果。

你的MPI_Allreduces：

   call MPI_ALLREDUCE (Nx1(j),Nx5(j),npe,mpi_integer,mpi_sum, &
     &                 MPI_COMM_WORLD, mpierr)
   call MPI_ALLREDUCE (Nx2(j),Nx6(j),npe,mpi_integer,mpi_sum, &
     &                 MPI_COMM_WORLD, mpierr)

实际上看起来计数应该是1。这是因为count只是说明你要从每个进程接收多少元素，而不是总共有多少元素。

然而，你实际上并不需要那个循环，因为allreduce幸运地能够同时处理多个元素。因此，我相信不是你的allreduces循环，你实际上想要的东西：

   integer :: Nx1(nint)
   integer :: Nx2(nint)
   integer :: Nx5(nint)
   integer :: Nx6(nint)

   call MPI_ALLREDUCE (Nx1, Nx5, nint, mpi_integer, mpi_sum, &
     &                 MPI_COMM_WORLD, mpierr)
   call MPI_ALLREDUCE (Nx2, Nx6, nint, mpi_integer, mpi_sum, &
     &                 MPI_COMM_WORLD, mpierr)

Nx5将包含所有分区中Nx1的总和，Nx6将包含Nx2之和。您的问题中的信息有点缺乏，所以我不太确定，如果这是您正在寻找的。

MPI_Allreduce总和

1 个答案: