使用OpenMP并在循环中调用子例程时减慢速度

时间:2017-11-24 18:45:55

标签: performance fortran openmp subroutine

这里我使用fortran提供一个简单的OpenMP代码,用于多次计算数组的总和。我的电脑有6个内核,12个线程,内存空间为16G。

此代码有两个版本。第一个版本只有1个文件test.f90,并且总和在此文件中实现。代码如下所示

program main
  implicit none

  integer*8 :: begin, end, rate
  integer i, j, k, ii, jj, kk, cnt
  real*8,allocatable,dimension(:,:,:)::theta, e

  allocate(theta(2000,50,5))
  allocate(e(2000,50,5))

  call system_clock(count_rate=rate)
  call system_clock(count=begin)

  !$omp parallel do
  do cnt = 1, 8
     do i = 1, 1001
        do j = 1, 50
           theta = theta+0.5d0*e
        end do
     end do       
  end do
  !$omp end parallel do

  call system_clock(count=end)
  write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate

  deallocate(theta)
  deallocate(e)

end program main

此版本在OpenMP上没有问题,我们可以看到加速。

修改第二个版本,以便在子例程中写入求和的实现。有两个文件test.f90sub.f90,如下所示

! test.f90
program main
  use sub
  implicit none

  integer*8 :: begin, end, rate
  integer i, j, k, ii, jj, kk, cnt

  call system_clock(count_rate=rate)
  call system_clock(count=begin)

  !$omp parallel do
  do cnt = 1, 8
    call summation()
  end do
  !$omp end parallel do

  call system_clock(count=end)
  write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate

end program main

! sub.f90
module sub
  implicit none

contains

  subroutine summation()
    implicit none
    real*8,allocatable,dimension(:,:,:)::theta, e
    integer i, j

    allocate(theta(2000,50,5))
    allocate(e(2000,50,5))

    theta = 0.d0
    e = 0.d0

    do i = 1, 101
      do j = 1, 50
        theta = theta+0.5d0*e
      end do
    end do

    deallocate(theta)
    deallocate(e)

  end subroutine summation

end module sub

我还写了一个Makefile如下

FC = ifort -O2 -mcmodel=large -qopenmp
LN = ifort -O2 -mcmodel=large -qopenmp

FFLAGS = -c
LFLAGS =

result: sub.o test.o
    $(LN) $(LFLAGS) -o result test.o sub.o

test.o: test.f90
    $(FC) $(FFLAGS) -o test.o test.f90

sub.o: sub.f90
    $(FC) $(FFLAGS) -o sub.o sub.f90

clean:
    rm result *.o*  *.mod *.e*

(我们可以使用gfortran代替)但是,我们运行这个版本,使用OpenMP会有明显的减速,它甚至比单线程慢一些(没有OpenMP)。那么,这里发生了什么以及如何解决这个问题?

0 个答案:

没有答案