我有一个关于在群集中共享内存的问题。我是群集的新手,在尝试了几周之后未能解决我的问题,所以我在这里寻求帮助,任何建议都会感激不尽!
我想使用soapdenovo,这是一种用于组装人类基因组以汇编数据的软件。但是,由于内存不足(内存在我的机器中为512G),它一步失败了。所以我转向集群机器(它有三个大节点,每个节点也有512个内存),并开始用qsub学习提交作业。考虑到一个节点无法解决我的问题,我用Google搜索并发现openmpi可能会有所帮助,但是当我使用demo数据运行openmpi时,它似乎只运行了几次命令。然后我发现使用openmpi,该软件必须包含openmpi库,我不知道soapdenovo是否支持openmpi,我问过这个问题,但作者还没有给我回答。假设soapdenovo支持openmpi,我应该如何解决我的问题。如果它不支持openmpi,我可以在不同的节点中使用内存来运行该软件吗?
这个问题折磨了我,感谢任何帮助。以下是我的工作以及有关集群机器的一些信息:
1)工作脚本:
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#
export PATH=/tools/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/tools/openmpi/lib:$LD_LIBRARY_PATH
soapPath="/tools/SOAPdenovo2/SOAPdenovo-63mer"
workPath="/NGS"
outputPath="assembly/soap/demo"
/tools/openmpi/bin/mpirun $soapPath all -s $workPath/$outputPath/config_file -K 23 -R -F -p 60 -V -o $workPath/$outputPath/graph_prefix > $workPath/$outputPath/ass.log 2> $workPath/$outputPath/ass.err
2)提交工作:
qsub -pe orte 60 mpi.qsub
3)登录ass.err
a)根据日志,似乎有几次运行soapdenovo
cat ass.err | grep "Pregraph" | wc -l
60
b)详细信息
less ass.err (it seemed it only run soapdenov several times, because when I run it in my machine, it would only output one Pregraph):
Version 2.04: released on July 13th, 2012
Compile Apr 27 2016 15:50:02
********************
Pregraph
********************
Parameters: pregraph -s /NGS/assembly/soap/demo/config_file -K 23 -p 16 -R -o /NGS/assembly/soap/demo/graph_prefix
In /NGS/assembly/soap/demo/config_file, 1 lib(s), maximum read length 35, maximum name length 256.
Version 2.04: released on July 13th, 2012
Compile Apr 27 2016 15:50:02
********************
Pregraph
********************
and so on
c)stdin的信息
cat ass.log:
--------------------------------------------------------------------------
WARNING: A process refused to die despite all the efforts!
This process may still be running and/or consuming resources.
Host: smp03
PID: 75035
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 58 with PID 0 on node c0214.local exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
1)qconf -sql
all.q
smp.q
2)qconf -spl
mpi
mpich
orte
zhongxm
3)qconf -sp zhongxm
pe_name zhongxm
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
4)qconf -sq smp.q
qname smp.q
hostlist @smp.q
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make zhongxm
rerun FALSE
slots 1
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
5)qconf -sq all.q
qname all.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make zhongxm
rerun FALSE
slots 16,[c0219.local=32]
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists mobile
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
答案 0 :(得分:0)
根据https://hpc.unt.edu/soapdenovo,该软件不支持MPI:
此代码不是使用MPI编译的,只能通过线程模型在SINGLE节点上并行使用。
因此,您不能只在群集上使用mpiexec启动软件以访问更多内存。群集机器与非相干网络(以太网,Infiniband)连接,这些网络比内存总线慢,群集中的PC不共享内存。集群使用MPI库(OpenMPI或MPICH)来处理网络,节点之间的所有请求都是显式的:程序在一个进程中调用MPI_Send,在另一个进程中调用MPI_Recv。还有像MPI_Put / MPI_Get这样的单向调用来访问远程内存(RDMA - 远程直接内存访问),但这与本地内存不同。
答案 1 :(得分:0)
osgx,非常感谢您的回复,并对此消息的延迟感到抱歉。
由于我不是计算机专业,我认为我不能很好地理解一些词汇表,比如ELF。所以有一些新的问题,我列出我的问题如下,感谢帮助advace:
1)当我"使用SOAPdenovo-63mer"时,它输出"不是动态可执行文件",这意味着"代码不符合MPI"你提到过吗?
2)简而言之,我无法解决集群的问题,我必须寻找一台内存超过512G的机器?
3)另外,我使用了另一个名为ALLPATHS-LG(http://www.broadinstitute.org/software/allpaths-lg/blog/)的软件,该软件因内存不足而失败,根据FAQ C1(http://www.broadinstitute.org/software/allpaths-lg/blog/?page_id=336),它是什么"它使用共享内存并行化"意思是,这意味着它可以在集群中使用内存,还是只在节点中使用内存,我必须找到一台内存足够的机器?
C1. Can I run ALLPATHS-LG on a cluster?
You can, but it will only use one machine, not the entire cluster. That machine would need to have enough memory to fit the entire assembly. ALLPATHS-LG does not support distributed computing using MPI, instead it uses Shared Memory Parallelization.
顺便说一句,这是我第一次发布在这里,我想我应该使用commit来回复,考虑到这么多的话,我用"回答你的问题"。