Question

我有一个矢量，其长度可以达到几百万或更多。

如果我说矢量是vec = [a1 a2 ...b1 b2 ... c1 c2 ...d1 d2 ...]

我需要将vec重新排列为new_vec = [a1 b1 c1 d1 a2 b2 c2 d2 ...]。

如果将其视为列向量矩阵，则可以将其视为转置，但我没有二维向量。我知道如何在顺序计算机上执行此操作。这很简单。

但我不知道如何在多处理器群集或GPU上执行此操作，或者即使在并行计算机上这是可行的。记忆似乎是明显的瓶颈。如果我可以使用任何算法或任何体系结构特定的优化，请告诉我。

编辑：下面有更多信息。

代码结构是：

subroutine reorder(vec,parameter)

real(kind = 8),dimension(parameter%length), intent(inout) :: vec
real(kind = 8),dimension(parameter%length) :: temp
type(param)                                :: parameter !just a struct holding certain constant parameters
integer                                    :: i,j,k,q1,q2,q3,nn1,n1,n2,nn2

i1 = parameter%len1    !lengths of sub-vectors in each direction 
i2 = parameter%len2    !the multiplication of the 3 gives the 
                       !overall length of vec
i3 = parameter%len3

temp = vec

n1 = i2*i1
n2 = i2*i3
   do k = 1, i3
      q1 = n1*(k-1)
      q2 = i2*(k-1)
      do j = 1, i2
         q3 = i1*(j-1)
         do i = 1, i1
            nn1 = q1+q3+i
            nn2 = q2+j+n2*(i-1)
            vec(nn2) = temp(nn1)
         end do
      end do
   end do

end subroutine reorder

因此，代码旨在根据规则重新排序向量的元素。正如您所看到的那样，向量的长度变得非常大，在此例程中花费了大量时间。

此例程在主例程中多次调用。笛卡尔分解在开始时产生笛卡尔3D排列，每个等级在下一个子程序调用需要对元素进行重新排序时调用该子程序。笛卡尔通信器子程序显示在下面的骨架中：

subroutine cartesian_comm(ndim,comm_cart,comm_one_d,coord_cart)
use mpi
implicit none
integer, dimension(:), intent(in)  :: ndim
integer,               intent(out) :: comm_cart
integer, dimension(:), pointer     :: comm_one_d, coord_cart
logical, dimension(size(ndim))     :: period, remain
integer :: dim,code, i, rank

!creating the cartesian communicator
dim = 3
allocate(comm_one_d(dim),coord_cart(dim))
period   = .FALSE.
call MPI_CART_CREATE(MPI_COMM_WORLD, dim, ndim, period, .FALSE., comm_cart, code)
call MPI_COMM_RANK(comm_cart, rank, code)
call MPI_CART_COORDS(comm_cart, rank, dim, coord_cart, code)

!Creating sub-communicator for each direction
do i = 1, dim
   remain = .FALSE.
   remain(i) = .TRUE.
   call MPI_CART_SUB(comm_cart, remain, comm_one_d(i), code)
end do
end subroutine cartesian_comm

在主函数中调用它如下：

Program main
!initialize some stuff and intialize all the required variables

! ndim is the number of processes the program is called 
! with "mpirun -np 8 ./exec" would mean ndim is cuberoot of 8,
! and therefore 2 for the 3D case. It is always made sure that
! np is a cube(or square for 2D) while calling the program

call cartesian_comm(ndim,comm_cart,comm_one_d,coord_cart)

  do while (t<tend-1D-8)  !start time loop
    t = t + dt
    !do some computations get the vector "vec" for 
    !each rank separately (different and independent in each rank)

    call reorder(vec,parameter) ! all ranks call this subroutine

    !do some computations here with the new reordered vec

  end do !end time loop

!do other stuff (unrelated to reorder but use the "vec" vector) here

end Program main

我想知道在多处理器集群中是否有更好的方法可以做到这一点，或者在GPU的情况下如何进行。

并行重新排列矢量以实现快速性能

0 个答案: