时间:2019-05-30 01:15:51

标签: mpi scalapack

使用scalapack例程PDGESV求解一组线性联立方程Ax = b的并行fortran代码失败(存在分段错误)。方程组N变大。我尚未确定出现问题的N的确切值,但是例如,该代码对于我测试的所有值(直到N = 50000)都可以正常工作,但在N = 94423时失败。

尤其是,在调用scalapack PDGESV例程期间会发生故障(即,在分配/取消分配内存时不会发生); 它会进入例程PDGESV,但不会退出此例程。

我正在使用具有148 GB内存的Linux Mint 18.3 Sylvia系统 使用Intel Xeon(R)CPU E5-1660 v4 @ 3.20GHz处理器。 我正在使用带有gfortran的mpifortran。

我很确定 fortran代码本身没有问题,因为该代码对于N的每个值和我尝试过的N = 50000的进程配置都可以完美运行,然后退出INFO = 0 表示没有发生错误的代码。 (我还运行了该程序的稍作修改的版本,该版本明确检查了解矩阵x *的残差,即计算出的Ax *-b并正确地找到了接近零的最大绝对值)。如果矩阵存在奇异问题,我们当然会改为使用非零INFO代码观察PDGESV例程的退出。

机器的内存似乎也足够;对于这个问题 在N = 94423的情况下,我们只需要65 GB的内存而可用的148 GB的内存就可以了,并且在分配时没有问题(此外,使用65 GB的内存来解决相同的问题的串行代码可以正常运行)。

我的感觉是,可能存在一些问题,可能超出了对mpi中的单个进程可用的内存的默认限制?也就是说,也许我只是在编译/运行时缺少一些合适的标志?

我尝试使用'ulimit -s unlimited'命令,但这不能解决问题。

我复制下面的fortran代码;这是一个简单的测试程序,它是:1)为矩阵A和向量b分配空间,2)用随机条目填充其条目3)调用PDGESV,然后4)释放内存。

我在下面列出了我使用的编译/执行命令(使用mpifortran / gfortran)。

请注意,我也尝试使用PGI fortran编译器,并在相同的测试用例中观察到相同的错误(请参见下面的错误输出)。

Fortran代码:

      PROGRAM SOLVE_LU
      USE MPI
      IMPLICIT NONE
      INTEGER :: N
      DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:,:) :: LOCAL_A
      DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:) :: LOCAL_B 
      INTEGER :: ISTATUS
C     FOR LAPACK PDGESV CALL
      INTEGER  :: INFO,  NRHS, IA, JA, IB, JB
      INTEGER, ALLOCATABLE, DIMENSION (:) :: IPIV
c     FOR READING COMMAND LINE ARGUMENTS
      INTEGER :: IARGC, N_COMMAND_ARG
      CHARACTER :: ARGV*10 
C     WE USE FOLLOWING COMMAND LINE ARGUMENTS 
C     ARG 1 : N (DIMENSION OF PROBLEM)
C     ARG 2 : NPROW (NO. OF ROWS OF PROCESSES IN A RECTANGULAR ARRAY)
C     ARG 3 : NPCOL (NO. OF COLUMNS OF PROCESSES IN A RECTANGULAR ARRAY)
C     ARG 4 : BLACS BLOCK SIZE MB (BLOCKS ARE OF SIZE MB * MB) 
C   
c     FOR PARALLEL PROCESS ARRAY
      INTEGER  :: NPROW, NPCOL, ICTXT,MYROW, MYCOL, MB, NB, MLOC, NLOC
      INTEGER :: IDESCA(9), IDESCB(9)
      INTEGER :: IERR
      INTEGER :: NUMROC


c     for random number seed
      INTEGER :: ISEEDSIZE
      INTEGER, ALLOCATABLE, DIMENSION ( :) :: SEED

C      ----------------------------------------
C      -------  EXECUTABLE STATEMENTS   -------


C      ===============================================
C      READ IN COMMAND LINE ARGUMENTS IF PRESENT

      N_COMMAND_ARG = iargc()
      IF (N_COMMAND_ARG == 2) THEN
          WRITE(*,*) 'ILLEGAL NO. OF COMMAND LINE PARAMETERS'
          STOP
      ENDIF
      IF (N_COMMAND_ARG .GE. 1)THEN
          CALL GETARG(1,argv)
C          WRITE(*,*)'ARGV = ',ARGV
          READ (ARGV,'(I10)') N
      ELSE
          N = 100
      ENDIF   

      IF (N_COMMAND_ARG .GE. 3)THEN
          CALL GETARG(2,argv)
          READ (ARGV,'(I10)') NPROW
          CALL GETARG(3,argv)
          READ (ARGV,'(I10)') NPCOL

      ELSE
          NPROW = 2
          NPCOL = 2
      ENDIF 

      IF (N_COMMAND_ARG .GE. 4)THEN
          CALL GETARG(4,argv)
          READ (ARGV,'(I10)') MB
      ELSE
          MB = 8
      ENDIF
      NB = MB

C     ==============================================
C     INITIALISE THE BLACS PROCESS GRID, FIND DIMENSIONS OF LOCAL
C     MATRICES / VECTORS AND ALLOCATE SPACE

      CALL SL_INIT(ICTXT, NPROW, NPCOL)
      CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL )

      MLOC = NUMROC(N, MB, MYROW, 0, NPROW)
      NLOC = NUMROC(N, NB, MYCOL, 0, NPCOL)

      IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 )WRITE(*,*)
     @       'WE ARE SOLVING A SYSTEM OF ', N, ' LINEAR EQUATIONS'

      WRITE(*,*) 'PROC: ',MYROW, MYCOL,'HAS  MLOC, NLOC =', MLOC,NLOC

c      ==============================================
C     ALLOCATE SPACE FOR MATRIX A AND VECTORS B AND X

      WRITE(*,*) 'PROC: ',MYROW, MYCOL,' ALLOCATING SPACE ...'

      ALLOCATE ( LOCAL_A(MLOC,NLOC), STAT = ISTATUS )
      IF(ISTATUS .NE. 0) THEN
          WRITE(*,*)'UNABLE TO ALLOCATE LOCAL_A, PROCESS: ',MYROW,MYCOL
          STOP
      ENDIF

      ALLOCATE ( LOCAL_B(MLOC), STAT = ISTATUS )
      IF (ISTATUS /= 0) THEN
          WRITE(*,*)
     @ ' FAILED TO ALLOCATE SPACE FOR LOCAL_B, PROCESS: ',MYROW,MYCOL
          STOP
      ENDIF

c     BLACS DESCRIPTOR FOR A AND ITS COPY
      CALL DESCINIT (IDESCA, N, N, MB, NB, 0, 0,
     @               ICTXT, MLOC, IERR)


c     BLACS DESCRIPTOR FOR B AND SOLN VECTOR X
      CALL DESCINIT (IDESCB, N, 1, MB, 1, 0, 0, ICTXT, MLOC, IERR)  

c      ==============================================
C      FILL ENTRIES OF MATRIX A AND R.H.S. VECTOR B WITH RANDOM ENTRIES

      WRITE(*,*)'PROC: ',MYROW, MYCOL,
     @        ' CONSTRUCTING MATRIX A AND RHS VECTOR B ...'

      CALL RANDOM_SEED

      CALL RANDOM_SEED ( SIZE = ISEEDSIZE ) ! GET SIZE OF SEED ARRAY

      ALLOCATE ( SEED(1:ISEEDSIZE) )
      CALL RANDOM_SEED ( GET = SEED )

      SEED(1) = SEED(1) + NPCOL*MYROW + MYCOL ! ENSURES DIFFERENT SEED
                                              ! FOR EACH PROCESS
      CALL RANDOM_SEED ( PUT = SEED )

      CALL RANDOM_NUMBER(LOCAL_B)

      CALL RANDOM_NUMBER(LOCAL_A)

c      ==============================================
C      CALL LAPACK LU SOLVER ROUTINE

      WRITE(*,*)'PROC: ',MYROW, MYCOL,
     @    'NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..'
      ALLOCATE ( IPIV(MLOC + MB), STAT=ISTATUS )
      IF(ISTATUS /= 0) THEN
          WRITE(*,*)'UNABLE TO ALLOCATE IPIV, PROCESS: ',MYROW,MYCOL
          STOP
      ENDIF


      IA = 1
      JA = 1
      IB = 1
      JB = 1
      NRHS = 1
      INFO = 0

      CALL PDGESV(N, NRHS, LOCAL_A, IA, JA, IDESCA, IPIV, 
     @            LOCAL_B, IB, JB, IDESCB, INFO )

      IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN
          WRITE(*,*)
          WRITE(*,*) 'INFO code returned by PDGESV = ', INFO
          WRITE(*,*)
      END IF


c      ==============================================
C     DEALLOCATE MEMORY
      DEALLOCATE(LOCAL_A, STAT=ISTATUS)
      IF(ISTATUS /= 0) THEN
          WRITE(*,*)'UNABLE TO DEALLOCATE ' 
          STOP
      ENDIF   


      DEALLOCATE(LOCAL_B, STAT=ISTATUS)
      IF(ISTATUS /= 0) THEN
          WRITE(*,*)'UNABLE TO DEALLOCATE ' 
          STOP
      ENDIF   


      DEALLOCATE(IPIV, STAT=ISTATUS)
      IF(ISTATUS /= 0) THEN
          WRITE(*,*)'UNABLE TO DEALLOCATE ' 
          STOP
      ENDIF   

c     ===================================================
c     RELEASE BLACS CONTEXT

      CALL BLACS_GRIDEXIT(ictxt)
      CALL BLACS_EXIT(0)


      END PROGRAM SOLVE_LU

我用编译上面的代码: mpifort -Wall -mcmodel = medium -static-libgfortran -m64 /opt/openblas/lib/libopenblas.a /usr/local/lib/libscalapack.a /opt/openblas/lib/libopenblas.a -lm -lpthread -lgfortran- lm -lpthread -lgfortran -o para.exe resolve_by_lu_parallelmpi_simple_light.for /opt/openblas/lib/libopenblas.a /usr/local/lib/libscalapack.a /opt/openblas/lib/libopenblas.a -lm -lpthread -lgfortran- lm -lpthread -lgfortran

不会产生任何错误或警告,并使用(例如)运行它:

mpirun -n 4 ./para.exe 944 2 2 32> DUMP05

在这里,我们使用2x2 BLACS处理阵列解决944 eqns系统  块大小为32。

对于这种小N例,我们得到(成功运行)输出:

我们正在解决944个线性方程组

PROC:0 0有MLOC,NLOC = 480480

PROC:0 0分配空间...

PROC:1 0有MLOC,NLOC = 464480

PROC:1 0分配空间...

PROC:0 0正在构造矩阵A和RHS矢量B ...

PROC:1 0正在构造矩阵A和RHS矢量B ...

PROC:1 1有MLOC,NLOC = 464464

PROC:1 1分配空间...

PROC:1 1构造矩阵A和RHS向量B ...

PROC:0 1有MLOC,NLOC = 480464

PROC:0 1分配空间...

PROC:0 1构造矩阵A和RHS矢量B ...

PROC:0 0现在使用SCALAPACK PDGESV解决系统AX = B

..  PROC:1 0现在使用SCALAPACK PDGESV解决系统AX = B

..  过程:1 1现在使用SCALAPACK PDGESV解决系统AX = B

..  PROC:0 1现在使用SCALAPACK PDGESV解决系统AX = B

..

PDGESV = 0返回的INFO代码

到目前为止,一切都很好。但是,使用:

运行

mpirun -n 4 ./para.exe 94423 2 2 32> DUMP06

产生以下错误(请注意,这样的执行需要65 GB的内存,并且在我的计算机上花费大约45分钟):

程序接收到的信号SIGSEGV:分段错误-无效的内存引用。

程序接收到的信号SIGSEGV:分段错误-无效的内存引用。

程序接收到的信号SIGSEGV:分段错误-无效的内存引用。

程序接收到的信号SIGSEGV:分段错误-无效的内存引用。

程序接收到的信号SIGSEGV:分段错误-无效的内存引用。

程序接收到的信号SIGSEGV:分段错误-无效的内存引用。

程序接收到的信号SIGSEGV:分段错误-无效的内存引用。

程序接收到的信号SIGSEGV:分段错误-无效的内存引用。

此错误的回溯:

此错误的回溯:

此错误的回溯:

此错误的回溯:

此错误的回溯:

此错误的回溯:

此错误的回溯:

此错误的回溯:

由于某种原因,没有打印回溯信息,但是使用PGI fortran编译器运行相同的代码(在运行Red Hat Linux 7.3的另一台机器上)会产生以下输出失败:

[sca1993:113193] *处理收到的信号*

[sca1993:113193]信号:分段错误(11)

[sca1993:113193]信号代码:地址未映射(1)

[sca1993:113193]在地址0x2b8c5a036390失败

[sca1993:113193] [0] /usr/lib/gcc/x86_64-redhat-linux/4.8.5 / .. / .. / ... /lib64/libpthread.so.0(+0xf5d0)[0x2b900528c5d0]

[sca1993:113193] [1] /usr/local/pgi/linux86-64/17.7/lib/libblas.so.0(+0x280c950)[0x2b9003acc950]

[sca1993:113193] [2] /usr/local/pgi/linux86-64/17.7/lib/libblas.so.0(daxpy_k_HASWELL+0x7f)[0x2b9003acc54f]

[sca1993:113193] [3] /usr/local/pgi/linux86-64/17.7/lib/libblas.so.0(dger_k_HASWELL+0xd5)[0x2b9003ad6635]

[sca1993:113193] [4] /usr/local/pgi/linux86-64/17.7/lib/libblas.so.0(dger_+0x21f)[0x2b90013d9f5f]

[sca1993:113193] [5] ./para_try.exe[0x446e70]

[sca1993:113193] [6] ./para_try.exe[0x41b4ad]

[sca1993:113193] [7] ./para_try.exe[0x4071e1]

[sca1993:113193] [8] ./para_try.exe[0x406b39]

[sca1993:113193] [9] ./para_try.exe[0x404ba6]

[sca1993:113193] [10] ./para_try.exe[0x403654]

[sca1993:113193] [11] /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6(__libc_start_main+ 0xf5)[0x2b9005cb83d5]

[sca1993:113193] [12] ./para_try.exe[0x403549]

[sca1993:113193] *错误消息结尾*

如果有人有任何建议,我将不胜感激。 非常感谢,           丹。

0 个答案:

没有答案