这里我使用fortran
提供一个简单的OpenMP
代码,用于多次计算数组的总和。我的电脑有6个内核,12个线程,内存空间为16G。
此代码有两个版本。第一个版本只有1个文件test.f90
,并且总和在此文件中实现。代码如下所示
program main
implicit none
integer*8 :: begin, end, rate
integer i, j, k, ii, jj, kk, cnt
real*8,allocatable,dimension(:,:,:)::theta, e
allocate(theta(2000,50,5))
allocate(e(2000,50,5))
call system_clock(count_rate=rate)
call system_clock(count=begin)
!$omp parallel do
do cnt = 1, 8
do i = 1, 1001
do j = 1, 50
theta = theta+0.5d0*e
end do
end do
end do
!$omp end parallel do
call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate
deallocate(theta)
deallocate(e)
end program main
此版本在OpenMP
上没有问题,我们可以看到加速。
修改第二个版本,以便在子例程中写入求和的实现。有两个文件test.f90
和sub.f90
,如下所示
! test.f90
program main
use sub
implicit none
integer*8 :: begin, end, rate
integer i, j, k, ii, jj, kk, cnt
call system_clock(count_rate=rate)
call system_clock(count=begin)
!$omp parallel do
do cnt = 1, 8
call summation()
end do
!$omp end parallel do
call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate
end program main
和
! sub.f90
module sub
implicit none
contains
subroutine summation()
implicit none
real*8,allocatable,dimension(:,:,:)::theta, e
integer i, j
allocate(theta(2000,50,5))
allocate(e(2000,50,5))
theta = 0.d0
e = 0.d0
do i = 1, 101
do j = 1, 50
theta = theta+0.5d0*e
end do
end do
deallocate(theta)
deallocate(e)
end subroutine summation
end module sub
我还写了一个Makefile
如下
FC = ifort -O2 -mcmodel=large -qopenmp
LN = ifort -O2 -mcmodel=large -qopenmp
FFLAGS = -c
LFLAGS =
result: sub.o test.o
$(LN) $(LFLAGS) -o result test.o sub.o
test.o: test.f90
$(FC) $(FFLAGS) -o test.o test.f90
sub.o: sub.f90
$(FC) $(FFLAGS) -o sub.o sub.f90
clean:
rm result *.o* *.mod *.e*
(我们可以使用gfortran
代替)但是,我们运行这个版本,使用OpenMP
会有明显的减速,它甚至比单线程慢一些(没有OpenMP
)。那么,这里发生了什么以及如何解决这个问题?