如何在集群机器中共享内存(qsub openmpi)

时间:2016-06-12 14:08:36

标签: cluster-computing openmpi qsub

亲爱的!

我有一个关于在群集中共享内存的问题。我是群集的新手,在尝试了几周之后未能解决我的问题,所以我在这里寻求帮助,任何建议都会感激不尽!

我想使用soapdenovo,这是一种用于组装人类基因组以汇编数据的软件。但是,由于内存不足(内存在我的机器中为512G),它一步失败了。所以我转向集群机器(它有三个大节点,每个节点也有512个内存),并开始用qsub学习提交作业。考虑到一个节点无法解决我的问题,我用Google搜索并发现openmpi可能会有所帮助,但是当我使用demo数据运行openmpi时,它似乎只运行了几次命令。然后我发现使用openmpi,该软件必须包含openmpi库,我不知道soapdenovo是否支持openmpi,我问过这个问题,但作者还没有给我回答。假设soapdenovo支持openmpi,我应该如何解决我的问题。如果它不支持openmpi,我可以在不同的节点中使用内存来运行该软件吗?

这个问题折磨了我,感谢任何帮助。以下是我的工作以及有关集群机器的一些信息:

  1. 安装openmpi并提交作业
  2. 1)工作脚本:

    #!/bin/bash
    #
    #$ -cwd
    #$ -j y
    #$ -S /bin/bash
    #
    
    export PATH=/tools/openmpi/bin:$PATH
    export LD_LIBRARY_PATH=/tools/openmpi/lib:$LD_LIBRARY_PATH
    soapPath="/tools/SOAPdenovo2/SOAPdenovo-63mer"
    workPath="/NGS"
    outputPath="assembly/soap/demo"
    /tools/openmpi/bin/mpirun $soapPath all -s $workPath/$outputPath/config_file -K 23 -R -F -p 60 -V -o $workPath/$outputPath/graph_prefix > $workPath/$outputPath/ass.log 2> $workPath/$outputPath/ass.err
    

    2)提交工作:

    qsub -pe orte 60 mpi.qsub
    

    3)登录ass.err

    a)根据日志,似乎有几次运行soapdenovo

    cat ass.err | grep "Pregraph" | wc -l
    60
    

    b)详细信息

    less ass.err (it seemed it only run soapdenov several times, because when I run it in my machine, it would only output one Pregraph):
    
    
    Version 2.04: released on July 13th, 2012
    Compile Apr 27 2016     15:50:02
    
    ********************
    Pregraph
    ********************
    
    Parameters: pregraph -s /NGS/assembly/soap/demo/config_file -K 23 -p 16 -R -o /NGS/assembly/soap/demo/graph_prefix 
    
    In /NGS/assembly/soap/demo/config_file, 1 lib(s), maximum read length 35, maximum name length 256.
    
    
    Version 2.04: released on July 13th, 2012
    Compile Apr 27 2016     15:50:02
    
    ********************
    Pregraph
    ********************
    
    and so on
    

    c)stdin的信息

    cat ass.log:
    
    --------------------------------------------------------------------------
    WARNING: A process refused to die despite all the efforts!
    This process may still be running and/or consuming resources.
    
    Host: smp03
    PID:  75035
    
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    mpirun noticed that process rank 58 with PID 0 on node c0214.local exited on signal 11 (Segmentation fault).
    --------------------------------------------------------------------------
    
    1. 有关群集的信息:
    2. 1)qconf -sql

      all.q
      smp.q
      

      2)qconf -spl

      mpi
      mpich
      orte
      zhongxm
      

      3)qconf -sp zhongxm

      pe_name            zhongxm
      slots              999
      user_lists         NONE
      xuser_lists        NONE
      start_proc_args    /bin/true
      stop_proc_args     /bin/true
      allocation_rule    $fill_up
      control_slaves     TRUE
      job_is_first_task  FALSE
      urgency_slots      min
      accounting_summary FALSE
      

      4)qconf -sq smp.q

      qname                 smp.q
      hostlist              @smp.q
      seq_no                0
      load_thresholds       np_load_avg=1.75
      suspend_thresholds    NONE
      nsuspend              1
      suspend_interval      00:05:00
      priority              0
      min_cpu_interval      00:05:00
      processors            UNDEFINED
      qtype                 BATCH INTERACTIVE
      ckpt_list             NONE
      pe_list               make zhongxm
      rerun                 FALSE
      slots                 1
      tmpdir                /tmp
      shell                 /bin/csh
      prolog                NONE
      epilog                NONE
      shell_start_mode      posix_compliant
      starter_method        NONE
      suspend_method        NONE
      resume_method         NONE
      terminate_method      NONE
      notify                00:00:60
      owner_list            NONE
      user_lists            NONE
      xuser_lists           NONE
      subordinate_list      NONE
      complex_values        NONE
      projects              NONE
      xprojects             NONE
      calendar              NONE
      initial_state         default
      s_rt                  INFINITY
      h_rt                  INFINITY
      s_cpu                 INFINITY
      h_cpu                 INFINITY
      s_fsize               INFINITY
      h_fsize               INFINITY
      s_data                INFINITY
      h_data                INFINITY
      s_stack               INFINITY
      h_stack               INFINITY
      s_core                INFINITY
      h_core                INFINITY
      s_rss                 INFINITY
      h_rss                 INFINITY
      s_vmem                INFINITY
      h_vmem                INFINITY
      

      5)qconf -sq all.q

      qname                 all.q
      hostlist              @allhosts
      seq_no                0
      load_thresholds       np_load_avg=1.75
      suspend_thresholds    NONE
      nsuspend              1
      suspend_interval      00:05:00
      priority              0
      min_cpu_interval      00:05:00
      processors            UNDEFINED
      qtype                 BATCH INTERACTIVE
      ckpt_list             NONE
      pe_list               make zhongxm
      rerun                 FALSE
      slots                 16,[c0219.local=32]
      tmpdir                /tmp
      shell                 /bin/csh
      prolog                NONE
      epilog                NONE
      shell_start_mode      posix_compliant
      starter_method        NONE
      suspend_method        NONE
      resume_method         NONE
      terminate_method      NONE
      notify                00:00:60
      owner_list            NONE
      user_lists            mobile
      xuser_lists           NONE
      subordinate_list      NONE
      complex_values        NONE
      projects              NONE
      xprojects             NONE
      calendar              NONE
      initial_state         default
      s_rt                  INFINITY
      h_rt                  INFINITY
      s_cpu                 INFINITY
      h_cpu                 INFINITY
      s_fsize               INFINITY
      h_fsize               INFINITY
      s_data                INFINITY
      h_data                INFINITY
      s_stack               INFINITY
      h_stack               INFINITY
      s_core                INFINITY
      h_core                INFINITY
      s_rss                 INFINITY
      h_rss                 INFINITY
      s_vmem                INFINITY
      h_vmem                INFINITY
      

2 个答案:

答案 0 :(得分:0)

根据https://hpc.unt.edu/soapdenovo,该软件不支持MPI:

  

此代码不是使用MPI编译的,只能通过线程模型在SINGLE节点上并行使用。

因此,您不能只在群集上使用mpiexec启动软件以访问更多内存。群集机器与非相干网络(以太网,Infiniband)连接,这些网络比内存总线慢,群集中的PC不共享内存。集群使用MPI库(OpenMPI或MPICH)来处理网络,节点之间的所有请求都是显式的:程序在一个进程中调用MPI_Send,在另一个进程中调用MPI_Recv。还有像MPI_Put / MPI_Get这样的单向调用来访问远程内存(RDMA - 远程直接内存访问),但这与本地内存不同。

答案 1 :(得分:0)

osgx,非常感谢您的回复,并对此消息的延迟感到抱歉。

由于我不是计算机专业,我认为我不能很好地理解一些词汇表,比如ELF。所以有一些新的问题,我列出我的问题如下,感谢帮助advace:

1)当我"使用SOAPdenovo-63mer"时,它输出"不是动态可执行文件",这意味着"代码不符合MPI"你提到过吗?

2)简而言之,我无法解决集群的问题,我必须寻找一台内存超过512G的机器?

3)另外,我使用了另一个名为ALLPATHS-LG(http://www.broadinstitute.org/software/allpaths-lg/blog/)的软件,该软件因内存不足而失败,根据FAQ C1(http://www.broadinstitute.org/software/allpaths-lg/blog/?page_id=336),它是什么"它使用共享内存并行化"意思是,这意味着它可以在集群中使用内存,还是只在节点中使用内存,我必须找到一台内存足够的机器?

C1. Can I run ALLPATHS-LG on a cluster?
You can, but it will only use one machine, not the entire cluster.  That machine would need to have enough memory to fit the entire assembly. ALLPATHS-LG does not support distributed computing using MPI, instead it uses Shared Memory Parallelization.

顺便说一句,这是我第一次发布在这里,我想我应该使用commit来回复,考虑到这么多的话,我用"回答你的问题"。