Question

我一直在尝试诊断为什么我的Fortran代码不能很好地扩展并将程序简化为一个非常简单的测试用例仍然无法很好地扩展。测试用例如下。我正在尝试创建一个数组，在处理器之间平均分配它，然后对它进行一些操作（在这种情况下，只需按weights进行缩放）。

我没有在进程之间传递任何信息，所以在我看来这应该可以很好地扩展，但是，当我运行越来越多的处理器并通过每个处理器操作的数组中的元素数量进行标准化时，我看到缩放非常差：

2处理器：

For Rank: 0 Average time for 100 iterations was 6.3680603710015505E-009 with 25525500 points per loop
For Rank: 1 Average time for 100 iterations was 6.3611264474244413E-009 with 25576551 points per loop

3处理器：

For Rank: 2 Average time for 100 iterations was 8.0085945661011481E-009 with 17102085 points per loop
For Rank: 0 Average time for 100 iterations was 8.2051102639337855E-009 with 16999983 points per loop
For Rank: 1 Average time for 100 iterations was 8.2249291072820462E-009 with 16999983 points per loop

4处理器：

For Rank: 0 Average time for 100 iterations was 1.0044801473036765E-008 with 12762750 points per loop
For Rank: 3 Average time for 100 iterations was 1.0046922454937459E-008 with 12813801 points per loop
For Rank: 1 Average time for 100 iterations was 1.0178132064014425E-008 with 12762750 points per loop
For Rank: 2 Average time for 100 iterations was 1.0260574719398254E-008 with 12762750 points per loop

6处理器：

For Rank: 1 Average time for 100 iterations was 1.5841797042924197E-008 with 8525517 points per loop
For Rank: 4 Average time for 100 iterations was 1.5990067816415119E-008 with 8525517 points per loop
For Rank: 0 Average time for 100 iterations was 1.6105490894647526E-008 with 8474466 points per loop
For Rank: 3 Average time for 100 iterations was 1.6141289610460415E-008 with 8474466 points per loop
For Rank: 5 Average time for 100 iterations was 1.5936059738580745E-008 with 8576568 points per loop
For Rank: 2 Average time for 100 iterations was 1.6052278119907569E-008 with 8525517 points per loop

我在带有64 GB RAM的MacPro 8核心桌面上运行，因此它不应受系统资源的限制，并且没有任何实际的消息传递我不知道为什么它应该随着更多核心的使用而逐渐运行得更慢。我是否遗漏了应该导致此问题的明显事项？使用GCC 5.1.0和Open MPI 1.6.5（编辑：使用-O3标志）。任何帮助，将不胜感激。谢谢！

代码：

PROGRAM MAIN
    use mpi
    implicit none
    real*8,allocatable::MX(:,:,:)
    real*8,allocatable::XFEQ(:,:,:,:)

    integer:: rank, iter, nte
    INTEGER:: top,bottom,xmin,xmax,zmin,zmax,q,ymax,ymin
    integer:: num_procs, error

    call MPI_Init ( error )                                 !  Initialize MPI.
    call MPI_Comm_size ( MPI_COMM_WORLD, num_procs, error ) !  Get the number of processes.
    call MPI_Comm_rank ( MPI_COMM_WORLD, rank, error )      !  Get the individual process ID.

    q = 7
    xmin = 0
    ymin = 0
    zmin = 0

    ymax = 1000
    xmax = 1000
    zmax = 50

    nte = 100

    top    =  rank   *ymax/num_procs
    bottom = (rank+1)*ymax/num_procs-1
    if (rank+1 == num_procs) bottom = ymax

    allocate(MX   ((ZMIN):(ZMAX),(xMIN):(xMAX),(top):(bottom)))
    allocate(xfeq  (0:Q-1,(ZMIN):(ZMAX),(xMIN):(xMAX),(top):(bottom)))

    DO ITER = 1, nte
        MX = 1
        CALL COMPFEQ(top, bottom, xmin, xmax, zmin, zmax, q, rank, iter, nte, xfeq, mx)
    ENDDO

!clean up and exit MPI
call MPI_Finalize ( error )

contains
SUBROUTINE COMPFEQ(top, bottom, xmin, xmax, zmin, zmax, q, rank, iter,  nte, xfeq, mx)
    implicit none
    INTEGER::I,J,L,top,bottom,xmin,xmax,zmin,zmax,q,rank, iter,  nte
    real*8::xfeq(0:Q-1,(ZMIN):(ZMAX),(xMIN):(xMAX), (top):(bottom))
    real*8::MX((ZMIN):(ZMAX),(xMIN):(xMAX),(top):(bottom))
    real*8::weight(0:q-1)

    real*8::time_start, time_stop, time_col = 0
    integer :: count
    count = 0

    weight(0) = 0.25
    weight(1:q-1) = 0.125
    CALL CPU_TIME ( TIME_start )
    DO J=top,bottom
    DO I=XMIN,XMAX
    DO L=zmin, zmax
        XFEQ(:,L,I,J) = weight*MX(L,I,J)
        count = count +1
    ENDDO
    ENDDO
    ENDDO
    CALL CPU_TIME ( TIME_stop )

    time_col = time_col + (time_stop - time_start)/count

    if (iter == nte) print*, "For Rank: ",rank, "Average time for ",nte,'iterations was', &
                                time_col/(iter+nte), "with ", count, "points per loop"

END SUBROUTINE

END PROGRAM

Fortran MPI减缓了高度并行的任务

0 个答案: