我有一个Fortran代码,我正在用f2py编译以在python中运行。我编写这段代码是一种更快的方法来处理已经存在的python代码,但它实际上比它的python运行速度慢,这让我觉得它没有经过优化。 (这可能与this question有关,虽然那个案子的罪魁祸首是我在这里没有使用的,所以它不适用于我。)
代码获得4D矩阵,因为输入使用尺寸3和4执行2D相关(就像它们是x和y一样)。
代码就是这个:
SUBROUTINE correlate4D(Vars, nxc, nyc, nv, nt, vCorr)
! Correlates 4D array assuming that x, y are dims 3 and 4
! Returns a 4D array with shape nv, nt, 2*nxc+1, 2*nyc+1
IMPLICIT NONE
real(kind=8), intent(in), dimension(:,:,:,:) :: Vars
real(kind=8) :: dummysize
integer, intent(in) :: nxc, nyc
integer :: ii, i, iix, iiy, iv, it, dims(4), nv, nt, nx, ny
integer, dimension(2*nxc+1) :: xdel
integer, dimension(2*nyc+1) :: ydel
real(kind=8), intent(out) :: vCorr(nv, nt, 2*nxc+1, 2*nyc+1)
real(kind=8), dimension(:,:,:,:), allocatable :: rolled, prerolled
real(kind=8), dimension(:,:), allocatable :: Mean
real(kind=8), dimension(:,:,:), allocatable :: Mean3d
dims = shape(Vars)
nx=dims(3)
ny=dims(4)
dummysize=nx*ny
allocate(rolled(nv, nt, nx, ny))
allocate(prerolled(nv, nt, nx, ny))
allocate(Mean3d(nv, nt, nx))
allocate(Mean(nv, nt))
Mean3d = sum(Vars, dim=4)/size(Vars, dim=4)
Mean = sum(Mean3d, dim=3)/size(Mean3d, dim=3)
! These replace np.arange()
ii=1
do i=-nxc,+nxc
xdel(ii)=i
ii=ii+1
enddo
ii=1
do i=-nyc,+nyc
ydel(ii)=i
ii=ii+1
enddo
! Calculate the correlation
do iiy=1,size(ydel)
print*,'fortran loop:',iiy,' of', size(ydel)
! cshift replaces np.roll()
prerolled = cshift(Vars, ydel(iiy), dim=4)
do iix=1,size(xdel)
rolled = cshift(prerolled, xdel(iix), dim=3)
forall (it=1:nt)
forall (iv=1:nv)
vCorr(iv,it,iix,iiy) = (sum(Vars(iv,it,:,:) * rolled(iv,it,:,:))/dummysize) / (Mean(iv,it)**2)
endforall
endforall
enddo
enddo
END SUBROUTINE
使用大小为(3, 50, 100, 100)
的矩阵运行此代码需要251秒,此代码使用f2py
编译,并且使用纯python / numpy代码仅需103秒。顺便说一下,这有点像矩阵的平均大小,因为输入应该类似于(3, 300, 100, 100)
,但不会大于此。
有人能指出我可以优化此代码的方法吗?
修改
我正在使用f2py3.4 -c mwe.f90 -m mwe
进行编译,然后可以使用
In [1]: import mwe
In [2]: import numpy as np
In [3]: a=np.random.randn(3,15,100,100)
In [4]: mwe.correlate4d(a, 50, 50, 3, 15)
EDIT2
阅读评论后,我能够通过改变索引的顺序来改进它。现在它比Python快10%左右,但它仍然太慢了。我相信这可以更快地完成。
SUBROUTINE correlate4D2(Vars, nxc, nyc, nt, nv, vCorr)
! Correlates 4D array assuming that x, y are dims 1 and 2
! Returns a 4D array with shape 2*nxc+1, 2*nyc+1, nt, nv
IMPLICIT NONE
INTEGER, PARAMETER :: dp = SELECTED_REAL_KIND (13)
real(kind=8), intent(in), dimension(:,:,:,:) :: Vars
real(kind=8) :: dummysize
integer, intent(in) :: nxc, nyc
integer :: ii, i, iix, iiy, iv, it, dims(4), nv, nt, nx, ny
integer, dimension(2*nxc+1) :: xdel
integer, dimension(2*nyc+1) :: ydel
!real(kind=8), intent(out) :: vCorr(nv, nt, 2*nxc+1, 2*nyc+1)
real(kind=8), intent(out) :: vCorr(2*nxc+1, 2*nyc+1, nt, nv)
real(kind=8), dimension(:,:,:,:), allocatable :: rolled, prerolled
real(kind=8), dimension(:,:), allocatable :: Mean
real(kind=8), dimension(:,:,:), allocatable :: Mean3d
dims = shape(Vars)
nx=dims(1)
ny=dims(1)
dummysize=nx*ny
allocate(rolled(nx, ny, nt, nv))
allocate(prerolled(nx, ny, nt, nv))
allocate(Mean3d(ny, nt, nv))
allocate(Mean(nt, nv))
Mean3d = sum(Vars, dim=1)/size(Vars, dim=1)
Mean = sum(Mean3d, dim=1)/size(Mean3d, dim=1)
ii=1
do i=-nxc,+nxc
xdel(ii)=i
ii=ii+1
enddo
ii=1
do i=-nyc,+nyc
ydel(ii)=i
ii=ii+1
enddo
do iiy=1,size(ydel)
print*,'fortran loop:',iiy,' of', size(ydel)
prerolled = cshift(Vars, ydel(iiy), dim=2)
do iix=1,size(xdel)
rolled = cshift(prerolled, xdel(iix), dim=1)
forall (iv=1:nv)
forall (it=1:nt)
vCorr(iix,iiy,it,iv) = (sum(Vars(:,:,it,iv) * rolled(:,:,it,iv))/dummysize) / (Mean(it,iv)**2)
endforall
endforall
enddo
enddo
END SUBROUTINE
此外,即使代码中有dp
参数(返回8,应该如此),如果我用real(dp)
f2py声明变量,则会抛出此错误:Parameter 'dp' at (1) has not been declared or is a variable
,即使它被宣布。这就是我直接使用8
的原因。
答案 0 :(得分:2)
注意:一个相当漫长而无聊的答案......
因为对大型矩阵重复使用cshift()
似乎很昂贵,所以我尝试了cshift
周围的一些修改。为此,我首先创建了OP代码的最小版本:
program main
implicit none
integer, parameter :: N = 100, nt = 50, dp = kind(0.0d0)
real(dp), allocatable, dimension(:,:,:) :: A, Ashift_y, Ashift, B
integer :: sx, sy, i, t
allocate( A( N, N, nt ), Ashift_y( N, N, nt ), Ashift( N, N, nt ), &
B( -N:N-1, -N:N-1, nt ) )
call initA
do sy = -N, N-1
if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
Ashift_y = cshift( A, sy, dim=2 )
do sx = -N, N-1
Ashift = cshift( Ashift_y, sx, dim=1 )
do t = 1, nt
B( sx, sy, t )= sum( A( :, :, t ) * Ashift( :, :, t ) )
enddo
enddo
enddo
print *, "sum(B) = ", sum(B)
print *, "sum( B( 0:N-1, 0:N-1, : ) ) = ", sum( B( 0:N-1, 0:N-1, : ) )
contains
subroutine initA
integer ix, iy
forall( t = 1:nt, iy = 1:N, ix = 1:N ) & ! (forall not recommended but for brevity)
A( ix, iy, t ) = 1.0_dp / ( mod(ix + iy + t, 100) + 1 )
endsubroutine
endprogram
给出了
sum(B) = 53817771.021093562
sum( B( 0:N-1, 0:N-1, : ) ) = 13454442.755258575
Mac mini (2.3GHz,4-core), gfortran-6.3 -O3 : 50 sec
Linux (2.6GHz,16-core), gfortran-4.8 -O3 : 32 sec
接下来,因为cshift(A,s,dim=1 (or 2))
相对于班次s
是周期性的(周期性为N
),所以可以拆分sx
和sy
上的循环分成四个部分,只保留第一个象限(即{0,N-1]中的sx
和sy
。通过简单地复制第一象限的数据来获得其他象限的数据。这样可以将CPU时间减少4个。(更简单地说,我们只能在[-N / 2,N / 2]中计算sx
和sy
,因为B
用于其他区域没有提供新的信息。)
do sy = 0, N-1
if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
Ashift_y = cshift( A, sy, dim=2 )
do sx = 0, N-1
Ashift = cshift( Ashift_y, sx, dim=1 )
do t = 1, nt
B( sx, sy, t )= sum( A( :, :, t ) * Ashift( :, :, t ) )
enddo
enddo
enddo
print *, "sum( B( 0:N-1, 0:N-1, : ) ) = ", sum( B( 0:N-1, 0:N-1, : ) )
!! Make "full" B.
B( -N : -1, 0 : N-1, : ) = B( 0 : N-1, 0 : N-1, : )
B( 0 : N-1, -N : -1, : ) = B( 0 : N-1, 0 : N-1, : )
B( -N : -1, -N : -1, : ) = B( 0 : N-1, 0 : N-1, : )
print *, "sum(B) = ", sum(B)
结果与预期的完整计算一致:
sum(B) = 53817771.021093562
sum( B( 0:N-1, 0:N-1, : ) ) = 13454442.755258575
Mac : 12.8 sec
Linux : 8.3 sec
相应的Python代码可能如下所示:
from __future__ import print_function, division
import numpy as np
N, nt = 100, 50
A = np.zeros( (nt, N, N) )
B = np.zeros( (nt, N, N) )
for t in range(nt):
for iy in range(N):
for ix in range(N):
A[ t, iy, ix ] = 1.0 / ( (ix + iy + t) % 100 + 1 )
for sy in range( N ):
if sy % (N // 10) == 0 : print( "sy = ", sy )
Ashift_y = np.roll( A, -sy, axis=1 )
for sx in range( N ):
Ashift = np.roll( Ashift_y, -sx, axis=2 )
for t in range( nt ):
B[ t, sy, sx ] = np.sum( A[ t, :, : ] * Ashift[ t, :, : ] )
print( "sum( B( :, 0:N-1, 0:N-1 ) ) = ", np.sum( B ) )
在Mac和Linux上运行22--24秒(python3.5)。
为了进一步降低成本,我们利用cshift
可以两种等效方式使用的事实:
cshift( array, s ) == array( cshift( [1,2,...,n], s ) ) !! assuming that "array" is declared as a( n )
然后我们可以重写上面的代码,cshift()
仅收到ind = [1,2,...,N]
:
integer, dimension(N) :: ind, indx, indy
ind = [( i, i=1,N )]
do sy = 0, N-1
if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
indy = cshift( ind, sy )
do sx = 0, N-1
indx = cshift( ind, sx )
do t = 1, nt
B( sx, sy, t )= sum( A( :, :, t ) * A( indx, indy, t ) )
enddo
enddo
enddo
在Mac和Linux上运行约5秒。类似的方法也可能适用于Python。 (我也尝试明确地使用mod()
索引来完全消除cshift
,但有点令人惊讶的是,它比上面的代码慢了两倍......)
即使有这种减少,代码也会变得很慢nt
(如问题所示300)。在这种情况下,我们可以使用最终武器,使sy
上的循环并行化:
program main
implicit none
integer, parameter :: N = 100, nt = 50, dp = kind(0.0d0)
! integer, parameter :: N = 100, nt = 300, dp = kind(0.0d0)
real(dp), allocatable, dimension(:,:,:) :: A, B
integer, dimension(N) :: ind, indx, indy
integer :: sx, sy, i, t
allocate( A( N, N, nt ), B( -N:N-1, -N:N-1, nt ) )
call initA
ind = [( i, i=1,N )]
!$omp parallel do private(sx,sy,t,indx,indy)
do sy = 0, N-1
if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
indy = cshift( ind, sy )
do sx = 0, N-1
indx = cshift( ind, sx )
do t = 1, nt
B( sx, sy, t )= sum( A( :, :, t ) * A( indx, indy, t ) )
enddo
enddo
enddo
!$omp end parallel do
print *, "sum( B( 0:N-1, 0:N-1, : ) ) = ", sum( B( 0:N-1, 0:N-1, : ) )
! "contains subroutine initA ..." here
endprogram
时间数据是这样的(使用gfortran -O3 -fopenmp):
N = 100, nt = 50
sum( B( 0:N-1, 0:N-1, : ) ) = 13454442.755258575
Mac:
serial : 5.3 sec
2 threads : 2.6 sec
4 threads : 1.4 sec
N = 100, nt = 50
sum( B( 0:N-1, 0:N-1, : ) ) = 13454442.755258575
Linux:
serial : 4.8 sec
2 threads : 2.7 sec
4 threads : 1.3 sec
8 threads : 0.7 sec
16 threads : 0.4 sec
32 threads : 0.4 sec
N = 100, nt = 300 // heavy case
sum( B( 0:N-1, 0:N-1, : ) ) = 80726656.531429410
Linux:
2 threads: 16.5 sec
4 threads: 8.4 sec
8 threads: 4.4 sec
16 threads: 2.5 sec
所以,如果上面的代码没有错误(希望如此!),我们可以通过(1)将sx
和sy
限制为[0,N-1]来节省大量CPU时间(或更简单的[-N / 2,N / 2]无需进一步复制),(2)将cshift
应用于索引数组(而不是数据数组),和/或(3){{1}上的并行化}(可能希望与f2py结合......)