我一直在尝试诊断为什么我的Fortran代码不能很好地扩展并将程序简化为一个非常简单的测试用例仍然无法很好地扩展。测试用例如下。我正在尝试创建一个数组,在处理器之间平均分配它,然后对它进行一些操作(在这种情况下,只需按weights
进行缩放)。
我没有在进程之间传递任何信息,所以在我看来这应该可以很好地扩展,但是,当我运行越来越多的处理器并通过每个处理器操作的数组中的元素数量进行标准化时,我看到缩放非常差:
2处理器:
For Rank: 0 Average time for 100 iterations was 6.3680603710015505E-009 with 25525500 points per loop
For Rank: 1 Average time for 100 iterations was 6.3611264474244413E-009 with 25576551 points per loop
3处理器:
For Rank: 2 Average time for 100 iterations was 8.0085945661011481E-009 with 17102085 points per loop
For Rank: 0 Average time for 100 iterations was 8.2051102639337855E-009 with 16999983 points per loop
For Rank: 1 Average time for 100 iterations was 8.2249291072820462E-009 with 16999983 points per loop
4处理器:
For Rank: 0 Average time for 100 iterations was 1.0044801473036765E-008 with 12762750 points per loop
For Rank: 3 Average time for 100 iterations was 1.0046922454937459E-008 with 12813801 points per loop
For Rank: 1 Average time for 100 iterations was 1.0178132064014425E-008 with 12762750 points per loop
For Rank: 2 Average time for 100 iterations was 1.0260574719398254E-008 with 12762750 points per loop
6处理器:
For Rank: 1 Average time for 100 iterations was 1.5841797042924197E-008 with 8525517 points per loop
For Rank: 4 Average time for 100 iterations was 1.5990067816415119E-008 with 8525517 points per loop
For Rank: 0 Average time for 100 iterations was 1.6105490894647526E-008 with 8474466 points per loop
For Rank: 3 Average time for 100 iterations was 1.6141289610460415E-008 with 8474466 points per loop
For Rank: 5 Average time for 100 iterations was 1.5936059738580745E-008 with 8576568 points per loop
For Rank: 2 Average time for 100 iterations was 1.6052278119907569E-008 with 8525517 points per loop
我在带有64 GB RAM的MacPro 8核心桌面上运行,因此它不应受系统资源的限制,并且没有任何实际的消息传递我不知道为什么它应该随着更多核心的使用而逐渐运行得更慢。我是否遗漏了应该导致此问题的明显事项?使用GCC 5.1.0和Open MPI 1.6.5(编辑:使用-O3标志)。任何帮助,将不胜感激。谢谢!
代码:
PROGRAM MAIN
use mpi
implicit none
real*8,allocatable::MX(:,:,:)
real*8,allocatable::XFEQ(:,:,:,:)
integer:: rank, iter, nte
INTEGER:: top,bottom,xmin,xmax,zmin,zmax,q,ymax,ymin
integer:: num_procs, error
call MPI_Init ( error ) ! Initialize MPI.
call MPI_Comm_size ( MPI_COMM_WORLD, num_procs, error ) ! Get the number of processes.
call MPI_Comm_rank ( MPI_COMM_WORLD, rank, error ) ! Get the individual process ID.
q = 7
xmin = 0
ymin = 0
zmin = 0
ymax = 1000
xmax = 1000
zmax = 50
nte = 100
top = rank *ymax/num_procs
bottom = (rank+1)*ymax/num_procs-1
if (rank+1 == num_procs) bottom = ymax
allocate(MX ((ZMIN):(ZMAX),(xMIN):(xMAX),(top):(bottom)))
allocate(xfeq (0:Q-1,(ZMIN):(ZMAX),(xMIN):(xMAX),(top):(bottom)))
DO ITER = 1, nte
MX = 1
CALL COMPFEQ(top, bottom, xmin, xmax, zmin, zmax, q, rank, iter, nte, xfeq, mx)
ENDDO
!clean up and exit MPI
call MPI_Finalize ( error )
contains
SUBROUTINE COMPFEQ(top, bottom, xmin, xmax, zmin, zmax, q, rank, iter, nte, xfeq, mx)
implicit none
INTEGER::I,J,L,top,bottom,xmin,xmax,zmin,zmax,q,rank, iter, nte
real*8::xfeq(0:Q-1,(ZMIN):(ZMAX),(xMIN):(xMAX), (top):(bottom))
real*8::MX((ZMIN):(ZMAX),(xMIN):(xMAX),(top):(bottom))
real*8::weight(0:q-1)
real*8::time_start, time_stop, time_col = 0
integer :: count
count = 0
weight(0) = 0.25
weight(1:q-1) = 0.125
CALL CPU_TIME ( TIME_start )
DO J=top,bottom
DO I=XMIN,XMAX
DO L=zmin, zmax
XFEQ(:,L,I,J) = weight*MX(L,I,J)
count = count +1
ENDDO
ENDDO
ENDDO
CALL CPU_TIME ( TIME_stop )
time_col = time_col + (time_stop - time_start)/count
if (iter == nte) print*, "For Rank: ",rank, "Average time for ",nte,'iterations was', &
time_col/(iter+nte), "with ", count, "points per loop"
END SUBROUTINE
END PROGRAM