Question

我有一个Fortran代码，我正在用f2py编译以在python中运行。我编写这段代码是一种更快的方法来处理已经存在的python代码，但它实际上比它的python运行速度慢，这让我觉得它没有经过优化。（这可能与this question有关，虽然那个案子的罪魁祸首是我在这里没有使用的，所以它不适用于我。）

代码获得4D矩阵，因为输入使用尺寸3和4执行2D相关（就像它们是x和y一样）。

代码就是这个：

SUBROUTINE correlate4D(Vars, nxc, nyc, nv, nt, vCorr)
! Correlates 4D array assuming that x, y are dims 3 and 4
! Returns a 4D array with shape nv, nt, 2*nxc+1, 2*nyc+1
IMPLICIT NONE
real(kind=8), intent(in), dimension(:,:,:,:) :: Vars
real(kind=8) :: dummysize
integer, intent(in) :: nxc, nyc
integer :: ii, i, iix, iiy, iv, it, dims(4), nv, nt, nx, ny
integer, dimension(2*nxc+1) :: xdel
integer, dimension(2*nyc+1) :: ydel
real(kind=8), intent(out) :: vCorr(nv, nt, 2*nxc+1, 2*nyc+1)
real(kind=8), dimension(:,:,:,:), allocatable :: rolled, prerolled
real(kind=8), dimension(:,:), allocatable :: Mean
real(kind=8), dimension(:,:,:), allocatable :: Mean3d

dims = shape(Vars)
nx=dims(3)
ny=dims(4)
dummysize=nx*ny
allocate(rolled(nv, nt, nx, ny))
allocate(prerolled(nv, nt, nx, ny))
allocate(Mean3d(nv, nt, nx))
allocate(Mean(nv, nt))

Mean3d = sum(Vars, dim=4)/size(Vars, dim=4)
Mean = sum(Mean3d, dim=3)/size(Mean3d, dim=3)

! These replace np.arange()
ii=1
do i=-nxc,+nxc
    xdel(ii)=i
    ii=ii+1
enddo
ii=1
do i=-nyc,+nyc
    ydel(ii)=i
    ii=ii+1
enddo

! Calculate the correlation
do iiy=1,size(ydel)
    print*,'fortran loop:',iiy,' of', size(ydel)
    ! cshift replaces np.roll()
    prerolled = cshift(Vars, ydel(iiy), dim=4)
    do iix=1,size(xdel)
        rolled = cshift(prerolled, xdel(iix), dim=3)
        forall (it=1:nt)
            forall (iv=1:nv)
                vCorr(iv,it,iix,iiy) = (sum(Vars(iv,it,:,:) * rolled(iv,it,:,:))/dummysize) / (Mean(iv,it)**2)
            endforall
        endforall
    enddo
enddo

END SUBROUTINE

使用大小为(3, 50, 100, 100)的矩阵运行此代码需要251秒，此代码使用f2py编译，并且使用纯python / numpy代码仅需103秒。顺便说一下，这有点像矩阵的平均大小，因为输入应该类似于(3, 300, 100, 100)，但不会大于此。

有人能指出我可以优化此代码的方法吗？

修改

我正在使用f2py3.4 -c mwe.f90 -m mwe进行编译，然后可以使用

进行调用

In [1]: import mwe
In [2]: import numpy as np
In [3]: a=np.random.randn(3,15,100,100)
In [4]: mwe.correlate4d(a, 50, 50, 3, 15)

EDIT2

阅读评论后，我能够通过改变索引的顺序来改进它。现在它比Python快10％左右，但它仍然太慢了。我相信这可以更快地完成。

SUBROUTINE correlate4D2(Vars, nxc, nyc, nt, nv, vCorr)
! Correlates 4D array assuming that x, y are dims 1 and 2
! Returns a 4D array with shape 2*nxc+1, 2*nyc+1, nt, nv
IMPLICIT NONE
INTEGER, PARAMETER  ::  dp = SELECTED_REAL_KIND (13)
real(kind=8), intent(in), dimension(:,:,:,:) :: Vars
real(kind=8) :: dummysize
integer, intent(in) :: nxc, nyc
integer :: ii, i, iix, iiy, iv, it, dims(4), nv, nt, nx, ny
integer, dimension(2*nxc+1) :: xdel
integer, dimension(2*nyc+1) :: ydel
!real(kind=8), intent(out) :: vCorr(nv, nt, 2*nxc+1, 2*nyc+1)
real(kind=8), intent(out) :: vCorr(2*nxc+1, 2*nyc+1, nt, nv)
real(kind=8), dimension(:,:,:,:), allocatable :: rolled, prerolled
real(kind=8), dimension(:,:), allocatable :: Mean
real(kind=8), dimension(:,:,:), allocatable :: Mean3d

dims = shape(Vars)
nx=dims(1)
ny=dims(1)
dummysize=nx*ny
allocate(rolled(nx, ny, nt, nv))
allocate(prerolled(nx, ny, nt, nv))
allocate(Mean3d(ny, nt, nv))
allocate(Mean(nt, nv))

Mean3d = sum(Vars, dim=1)/size(Vars, dim=1)
Mean = sum(Mean3d, dim=1)/size(Mean3d, dim=1)

ii=1
do i=-nxc,+nxc
    xdel(ii)=i
    ii=ii+1
enddo
ii=1
do i=-nyc,+nyc
    ydel(ii)=i
    ii=ii+1
enddo

do iiy=1,size(ydel)
    print*,'fortran loop:',iiy,' of', size(ydel)
    prerolled = cshift(Vars, ydel(iiy), dim=2)
    do iix=1,size(xdel)
        rolled = cshift(prerolled, xdel(iix), dim=1)
        forall (iv=1:nv)
            forall (it=1:nt)
                vCorr(iix,iiy,it,iv) = (sum(Vars(:,:,it,iv) * rolled(:,:,it,iv))/dummysize) / (Mean(it,iv)**2)
            endforall
        endforall
    enddo
enddo

END SUBROUTINE

此外，即使代码中有dp参数（返回8，应该如此），如果我用real(dp) f2py声明变量，则会抛出此错误：Parameter 'dp' at (1) has not been declared or is a variable ，即使它被宣布。这就是我直接使用8的原因。

Answer 1

注意：一个相当漫长而无聊的答案......

因为对大型矩阵重复使用cshift()似乎很昂贵，所以我尝试了cshift周围的一些修改。为此，我首先创建了OP代码的最小版本：

program main
    implicit none
    integer, parameter :: N = 100, nt = 50, dp = kind(0.0d0)
    real(dp), allocatable, dimension(:,:,:) :: A, Ashift_y, Ashift, B
    integer :: sx, sy, i, t

    allocate( A( N, N, nt ), Ashift_y( N, N, nt ), Ashift( N, N, nt ), &
              B( -N:N-1, -N:N-1, nt ) )
    call initA

    do sy = -N, N-1
        if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy

        Ashift_y = cshift( A, sy, dim=2 )

        do sx = -N, N-1
            Ashift = cshift( Ashift_y, sx, dim=1 )

            do t = 1, nt
                B( sx, sy, t )= sum( A( :, :, t ) * Ashift( :, :, t ) )
            enddo
        enddo
    enddo

    print *, "sum(B) = ", sum(B)
    print *, "sum( B( 0:N-1, 0:N-1, : ) ) = ", sum( B( 0:N-1, 0:N-1, : ) )

contains
    subroutine initA
        integer ix, iy
        forall( t = 1:nt, iy = 1:N, ix = 1:N )  &   ! (forall not recommended but for brevity)
                A( ix, iy, t ) = 1.0_dp / ( mod(ix + iy + t, 100) + 1 )
    endsubroutine    
endprogram

给出了

sum(B) =    53817771.021093562     
sum( B( 0:N-1, 0:N-1, : ) ) =    13454442.755258575 

Mac mini (2.3GHz,4-core), gfortran-6.3 -O3 : 50 sec
Linux (2.6GHz,16-core),   gfortran-4.8 -O3 : 32 sec

接下来，因为cshift(A,s,dim=1 (or 2))相对于班次s是周期性的（周期性为N），所以可以拆分sx和sy上的循环分成四个部分，只保留第一个象限（即{0，N-1]中的sx和sy。通过简单地复制第一象限的数据来获得其他象限的数据。这样可以将CPU时间减少4个。（更简单地说，我们只能在[-N / 2，N / 2]中计算sx和sy，因为B用于其他区域没有提供新的信息。）

    do sy = 0, N-1
        if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
        Ashift_y = cshift( A, sy, dim=2 )

        do sx = 0, N-1
            Ashift = cshift( Ashift_y, sx, dim=1 )

            do t = 1, nt
                B( sx, sy, t )= sum( A( :, :, t ) * Ashift( :, :, t ) )
            enddo
        enddo
    enddo

    print *, "sum( B( 0:N-1, 0:N-1, : ) ) = ", sum( B( 0:N-1, 0:N-1, : ) )

    !! Make "full" B.
    B( -N :  -1,  0 : N-1, : ) = B( 0 : N-1, 0 : N-1, : )
    B(  0 : N-1, -N :  -1, : ) = B( 0 : N-1, 0 : N-1, : )
    B( -N :  -1, -N :  -1, : ) = B( 0 : N-1, 0 : N-1, : )
    print *, "sum(B) = ", sum(B)

结果与预期的完整计算一致：

sum(B) =    53817771.021093562     
sum( B( 0:N-1, 0:N-1, : ) ) =    13454442.755258575     

Mac   : 12.8 sec
Linux :  8.3 sec

相应的Python代码可能如下所示：

from __future__ import print_function, division
import numpy as np

N, nt = 100, 50

A = np.zeros( (nt, N, N) )
B = np.zeros( (nt, N, N) )

for t in range(nt):
    for iy in range(N):
        for ix in range(N):
            A[ t, iy, ix ] = 1.0 / ( (ix + iy + t) % 100 + 1 )

for sy in range( N ):
    if sy % (N // 10) == 0 : print( "sy = ", sy )
    Ashift_y = np.roll( A, -sy, axis=1 )

    for sx in range( N ):
        Ashift = np.roll( Ashift_y, -sx, axis=2 )

        for t in range( nt ):
            B[ t, sy, sx ] = np.sum( A[ t, :, : ] * Ashift[ t, :, : ] )

print( "sum( B( :, 0:N-1, 0:N-1 ) ) = ",  np.sum( B ) )

在Mac和Linux上运行22--24秒（python3.5）。

为了进一步降低成本，我们利用cshift可以两种等效方式使用的事实：

cshift( array, s ) == array( cshift( [1,2,...,n], s ) )   !! assuming that "array" is declared as a( n )

然后我们可以重写上面的代码，cshift()仅收到ind = [1,2,...,N]：

    integer, dimension(N) :: ind, indx, indy
    ind = [( i, i=1,N )]

    do sy = 0, N-1
        if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
        indy = cshift( ind, sy )

        do sx = 0, N-1
            indx = cshift( ind, sx )

            do t = 1, nt
                B( sx, sy, t )= sum( A( :, :, t ) * A( indx, indy, t ) )
            enddo
        enddo
    enddo

在Mac和Linux上运行约5秒。类似的方法也可能适用于Python。（我也尝试明确地使用mod()索引来完全消除cshift，但有点令人惊讶的是，它比上面的代码慢了两倍......）

即使有这种减少，代码也会变得很慢nt（如问题所示300）。在这种情况下，我们可以使用最终武器，使sy上的循环并行化：

program main
    implicit none
    integer, parameter :: N = 100, nt = 50, dp = kind(0.0d0)
!    integer, parameter :: N = 100, nt = 300, dp = kind(0.0d0)
    real(dp), allocatable, dimension(:,:,:) :: A, B
    integer, dimension(N) :: ind, indx, indy
    integer :: sx, sy, i, t

    allocate( A( N, N, nt ), B( -N:N-1, -N:N-1, nt ) )
    call initA
    ind = [( i, i=1,N )]

    !$omp parallel do private(sx,sy,t,indx,indy)
    do sy = 0, N-1
        if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
        indy = cshift( ind, sy )

        do sx = 0, N-1
            indx = cshift( ind, sx )

            do t = 1, nt
                B( sx, sy, t )= sum( A( :, :, t ) * A( indx, indy, t ) )
            enddo
        enddo
    enddo
    !$omp end parallel do

    print *, "sum( B( 0:N-1, 0:N-1, : ) ) = ", sum( B( 0:N-1, 0:N-1, : ) )

    ! "contains subroutine initA ..." here

endprogram

时间数据是这样的（使用gfortran -O3 -fopenmp）：

N = 100, nt = 50
sum( B( 0:N-1, 0:N-1, : ) ) =    13454442.755258575     
Mac:
   serial : 5.3 sec
2 threads : 2.6 sec
4 threads : 1.4 sec

N = 100, nt = 50
sum( B( 0:N-1, 0:N-1, : ) ) =    13454442.755258575     
Linux:
    serial : 4.8 sec
 2 threads : 2.7 sec
 4 threads : 1.3 sec
 8 threads : 0.7 sec
16 threads : 0.4 sec
32 threads : 0.4 sec

N = 100, nt = 300   // heavy case
sum( B( 0:N-1, 0:N-1, : ) ) =    80726656.531429410     
Linux:
 2 threads: 16.5 sec
 4 threads:  8.4 sec
 8 threads:  4.4 sec
16 threads:  2.5 sec

所以，如果上面的代码没有错误（希望如此！），我们可以通过（1）将sx和sy限制为[0，N-1]来节省大量CPU时间（或更简单的[-N / 2，N / 2]无需进一步复制），（2）将cshift应用于索引数组（而不是数据数组），和/或（3）{{1}上的并行化}（可能希望与f2py结合......）

优化运行速度比其python版本

1 个答案: