Question

在我的Fortran90代码中，我有一个带有几个嵌套循环的外循环。为了加速我的代码，我尝试在外部循环上使用OpenMP，但是我有一个非常奇怪的问题：当我使用多于1个线程时，程序运行速度比使用带有1个线程的OMP慢，这比使用速度慢原始的串行程序（就挂钟时间而言。我试过1,2,3或4个线程）。然而，在所有情况下，我都得到了正确的结果。

我对我的代码进行了几次测试，最后我发现问题出现在一个子程序中，因为如果我对该例程的调用发表评论我的并行程序按预期工作，即线程数越大，越低挂钟时间。

现在，该例程接受输入4向量“ks1”，“ks2”，“ket1”，“ket2”并执行“ks1”和“ks2”之间的并集，获得“kstot”。然后它创建2个新向量，“ket1tot”和“ket2tot”，其中如果ks1tot（i）等于ks1（j），则ket1tot（i）等于ket1（j），否则ket1tot（i）= 0。对于ket2tot也一样。

然后组合存储在向量“ks1tot”，“ket1tot”，“ket2tot”中的值，我计算向量的哪些行（matFC）包含我需要的值，并通过乘以这些值，我得到最终结果（ FCtot）。

所以我在一个简单的程序中转换该例程，添加一些初始行以模仿真实程序。我的意思是我补充说：

1）一个循环（在i上）模仿我试图并行化的真实程序的外循环;

2）我实现了每个线程在不同文件上工作的事实（所以我不应该有错误的共享问题）

3）我添加了另一个循环（在k上），模仿我多次调用例程。

这是代码（构成原始子程序的部分，在文本中指出了问题）：

program evaluatefc
#ifdef _OPENMP
use omp_lib
#endif
implicit none   
integer::i,ii,j,jj,jjj,k,sizeks1,sizeks2,sizec,sizekstot,NR,NR1,maxnq
integer::line,ierr,fileunit,mythread,nfreqtot
real*8::FCtot,time1,time2
integer,allocatable,dimension(:)::ks1,ket1,ks2,ket2
integer,dimension(:),allocatable::c,kstot,ket1tot,ket2tot
real*8,allocatable,dimension(:)::matFC
character*15,allocatable,dimension(:)::matfileFC
character::fileFC*15
real*4::tstarting,tending
! This program was originally a subroutine 
! that takes in input 4 vectors, ks1, ks2, ket1, ket2
!---------------------------------------------------------------------------
! I initialize some values that in the original subroutine were computed by 
!the main program
allocate(matfileFC(3),stat=ierr)
matfileFC(1)='filea.dat'
matfileFC(2)='fileb.dat'
matfileFC(3)='filec.dat'
sizeks1=2
sizeks2=2
maxnq=11
allocate(ks1(sizeks1),stat=ierr)
allocate(ket1(sizeks1),stat=ierr)
allocate(ks2(sizeks2),stat=ierr)
allocate(ket2(sizeks2),stat=ierr)
nfreqtot=42
NR1=nfreqtot*(maxnq**2)+nfreqtot        
NR=nfreqtot*(maxnq**2)
allocate(matFC(NR),stat=ierr)
!Call two intrinsic to evaluate CPU and wall clock time
call cpu_time(time1)
tstarting=secnds(0.0)
!$OMP PARALLEL DO &
!$OMP DEFAULT(NONE) &
!$OMP firstprivate(sizeks1,sizeks2,maxnq,matfileFC,NR,NR1) &
!$OMP PRIVATE(i,ii,j,jj,k,ierr,mythread,fileunit,c,sizec,line,sizekstot) &
!$OMP PRIVATE(jjj,ket1,ks1,ket1tot,kstot,ket2,ks2,ket2tot,FCtot,matFC,fileFC)
do ii=1,3
   #ifdef _OPENMP
   mythread=OMP_GET_THREAD_NUM()
   #else
   mythread=10
   #endif
   fileFC=matfileFC(ii)
   ! Read some lines of a given file.
   fileunit=50+mythread
   open(unit=fileunit,name=fileFC,status='old',form='formatted')
   read(fileunit,*)!Do not read first line
   jjj=0
   do jj=1,NR1-1
       if(mod(jj,(maxnq**2+1)).eq.0) then
         read(fileunit,*)
       else
         jjj=jjj+1     
         read(fileunit,*)j,k,i,matFC(jjj)
   ! I actually need only the fourth valor of the line to be stored
       endif
   enddo
   close(fileunit)
   do k=1,10000000
       ! Again I initialize the abovementioned values that in the actual 
       ! subroutine are computed by the main program
       ks1(1)=mod(k,30)+1
       ks1(2)=mod(k,30)+2
       ks2(1)=mod(k,17)+1
       ks2(2)=mod(k,17)+3
       ket1(1)=mod(k,2)
       ket1(2)=mod(k,3)
       ket2(1)=mod(k,5)
       ket2(2)=mod(k,7)
       sizec=sizeks1+sizeks2
       allocate(c(sizec),stat=ierr)
       do i=1,sizeks1
           c(i)=ks1(i)
       enddo
       do i=sizeks1+1,sizec
          c(i)=ks2(i-sizeks1)
       enddo
       sizekstot=sizec
       do i=1,sizeks1
          do j=1,sizeks2
             if(ks1(i).eq.ks2(j)) then
               sizekstot=sizekstot-1
             endif
          enddo
       enddo
       allocate(kstot(sizekstot),stat=ierr)
       jjj=1
       i=1
       jj=0
       do i=1,sizec-1
           jjj=jjj+1
           do j=jjj,sizec
               if(c(i).eq.c(j)) then
                  exit   
               elseif(c(i).ne.c(j).and.j.eq.sizec) then
                  jj=jj+1
                  kstot(jj)=c(i)
               endif
           enddo
       enddo
       kstot(sizekstot)=c(sizec)
       allocate(ket1tot(sizekstot),stat=ierr)
       do i=1,sizekstot
           ket1tot(i)=0
       enddo
       allocate(ket2tot(sizekstot),stat=ierr)
       do i=1,sizekstot
           ket2tot(i)=0
       enddo
       do i=1,sizekstot
           do j=1,sizeks1
               if(kstot(i).eq.ks1(j))then
                  ket1tot(i)=ket1(j)
               endif
           enddo
       enddo
       do i=1,sizekstot
           do j=1,sizeks2
               if(kstot(i).eq.ks2(j))then
                     ket2tot(i)=ket2(j)
               endif
           enddo
       enddo
       FCtot=1
       do i=1,sizekstot
           line=(kstot(i)-1)*(maxnq)**2+ket1tot(i)*(maxnq)+ket2tot(i)+1
           FCtot=matFC(line)*FCtot
       enddo
       deallocate(c,stat=ierr)
       deallocate(kstot,stat=ierr)
       deallocate(ket1tot,stat=ierr)
       deallocate(ket2tot,stat=ierr)
   enddo
enddo
!$OMP END PARALLEL DO
call cpu_time(time2)
tending=secnds(tstarting)
write(*,*)
write(*,*)'CPU time is:'
write(*,*)time2-time1
write(*,*)
write(*,*)'Wall clock time is:'
write(*,*)tending
end program

尽管如此，我遇到了同样的问题，即使用4线程的挂钟时间比使用1线程更大。

例如我得到（以秒为单位）：

输入Wtime CPU时间

1个线程20.37 20.37

4个主题31.26 91.61

序列号19.64 19.64

我知道对OMP库的调用引入了开销，实际上1线程OMP程序比串行程序慢。但我无法理解为什么4线程OMP代码更慢。

我在Linux上使用英特尔fortran编译器2013。

有什么建议吗？

感谢您随时致力于解决此问题。

Answer 1

好的，我解决了自己的问题。

谢谢大家的建议，特别是@JorgeBellón和@High Performance Mark。

正如他们的评论所说，问题实际上是大量的分配/解除分配。如果我将分配移出循环，或者至少如果我在第一个循环之后放置它们，我得到“正常”的OpenMP行为，即线程数越大，挂钟时间越短。

对于上面的示例，使用4个线程的挂钟时间现在约为7秒。

谢谢大家的帮助。

OpenMP比串行Fortran90代码慢

1 个答案: