Question

亲爱的！

我有一个关于在群集中共享内存的问题。我是群集的新手，在尝试了几周之后未能解决我的问题，所以我在这里寻求帮助，任何建议都会感激不尽！

我想使用soapdenovo，这是一种用于组装人类基因组以汇编数据的软件。但是，由于内存不足（内存在我的机器中为512G），它一步失败了。所以我转向集群机器（它有三个大节点，每个节点也有512个内存），并开始用qsub学习提交作业。考虑到一个节点无法解决我的问题，我用Google搜索并发现openmpi可能会有所帮助，但是当我使用demo数据运行openmpi时，它似乎只运行了几次命令。然后我发现使用openmpi，该软件必须包含openmpi库，我不知道soapdenovo是否支持openmpi，我问过这个问题，但作者还没有给我回答。假设soapdenovo支持openmpi，我应该如何解决我的问题。如果它不支持openmpi，我可以在不同的节点中使用内存来运行该软件吗？

这个问题折磨了我，感谢任何帮助。以下是我的工作以及有关集群机器的一些信息：

安装openmpi并提交作业

1）工作脚本：

#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#

export PATH=/tools/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/tools/openmpi/lib:$LD_LIBRARY_PATH
soapPath="/tools/SOAPdenovo2/SOAPdenovo-63mer"
workPath="/NGS"
outputPath="assembly/soap/demo"
/tools/openmpi/bin/mpirun $soapPath all -s $workPath/$outputPath/config_file -K 23 -R -F -p 60 -V -o $workPath/$outputPath/graph_prefix > $workPath/$outputPath/ass.log 2> $workPath/$outputPath/ass.err

2）提交工作：

qsub -pe orte 60 mpi.qsub

3）登录ass.err

a）根据日志，似乎有几次运行soapdenovo

cat ass.err | grep "Pregraph" | wc -l
60

b）详细信息

less ass.err (it seemed it only run soapdenov several times, because when I run it in my machine, it would only output one Pregraph):


Version 2.04: released on July 13th, 2012
Compile Apr 27 2016     15:50:02

********************
Pregraph
********************

Parameters: pregraph -s /NGS/assembly/soap/demo/config_file -K 23 -p 16 -R -o /NGS/assembly/soap/demo/graph_prefix 

In /NGS/assembly/soap/demo/config_file, 1 lib(s), maximum read length 35, maximum name length 256.


Version 2.04: released on July 13th, 2012
Compile Apr 27 2016     15:50:02

********************
Pregraph
********************

and so on

c）stdin的信息

cat ass.log:

--------------------------------------------------------------------------
WARNING: A process refused to die despite all the efforts!
This process may still be running and/or consuming resources.

Host: smp03
PID:  75035

--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 58 with PID 0 on node c0214.local exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

有关群集的信息：

1）qconf -sql

all.q
smp.q

2）qconf -spl

mpi
mpich
orte
zhongxm

3）qconf -sp zhongxm

pe_name            zhongxm
slots              999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

4）qconf -sq smp.q

qname                 smp.q
hostlist              @smp.q
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make zhongxm
rerun                 FALSE
slots                 1
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

5）qconf -sq all.q

qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make zhongxm
rerun                 FALSE
slots                 16,[c0219.local=32]
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            mobile
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

Answer 1

根据https://hpc.unt.edu/soapdenovo，该软件不支持MPI：

此代码不是使用MPI编译的，只能通过线程模型在SINGLE节点上并行使用。

因此，您不能只在群集上使用mpiexec启动软件以访问更多内存。群集机器与非相干网络（以太网，Infiniband）连接，这些网络比内存总线慢，群集中的PC不共享内存。集群使用MPI库（OpenMPI或MPICH）来处理网络，节点之间的所有请求都是显式的：程序在一个进程中调用MPI_Send，在另一个进程中调用MPI_Recv。还有像MPI_Put / MPI_Get这样的单向调用来访问远程内存（RDMA - 远程直接内存访问），但这与本地内存不同。

Answer 2

osgx，非常感谢您的回复，并对此消息的延迟感到抱歉。

由于我不是计算机专业，我认为我不能很好地理解一些词汇表，比如ELF。所以有一些新的问题，我列出我的问题如下，感谢帮助advace：

1）当我＆＃34;使用SOAPdenovo-63mer＆＃34;时，它输出＆＃34;不是动态可执行文件＆＃34;，这意味着＆＃34;代码不符合MPI＆＃34;你提到过吗？

2）简而言之，我无法解决集群的问题，我必须寻找一台内存超过512G的机器？

3）另外，我使用了另一个名为ALLPATHS-LG（http://www.broadinstitute.org/software/allpaths-lg/blog/）的软件，该软件因内存不足而失败，根据FAQ C1（http://www.broadinstitute.org/software/allpaths-lg/blog/?page_id=336），它是什么＆＃34;它使用共享内存并行化＆＃34;意思是，这意味着它可以在集群中使用内存，还是只在节点中使用内存，我必须找到一台内存足够的机器？

C1. Can I run ALLPATHS-LG on a cluster?
You can, but it will only use one machine, not the entire cluster.  That machine would need to have enough memory to fit the entire assembly. ALLPATHS-LG does not support distributed computing using MPI, instead it uses Shared Memory Parallelization.

顺便说一句，这是我第一次发布在这里，我想我应该使用commit来回复，考虑到这么多的话，我用＆＃34;回答你的问题＆＃34;。

如何在集群机器中共享内存（qsub openmpi）

2 个答案: