首页 文章

如何在集群机器中共享内存(qsub openmpi)

提问于
浏览
0

亲爱的大家!

我有一个关于在群集中共享内存的问题 . 我是一个新的集群,并在尝试了几个星期后未能解决我的问题,所以我在这里寻求帮助,任何建议都会感激不尽!

我想使用soapdenovo,这是一种用于组装人类基因组以组装我的数据的软件 . 但是,由于内存不足(内存在我的机器中为512G),它一步失败了 . 所以我转向集群机器(它有三个大节点,每个节点也有512个内存),并开始用qsub学习提交作业 . 考虑到一个节点无法解决我的问题,我用Google搜索并发现openmpi可能会有所帮助,但是当我使用演示数据运行openmpi时,它似乎只运行了几次命令 . 然后我发现使用openmpi,该软件必须包含openmpi库,我不知道soapdenovo是否支持openmpi,我曾问过这个问题,但作者还没有给我回答 . 假设soapdenovo支持openmpi,我应该如何解决我的问题 . 如果它不支持openmpi,我可以在不同的节点中使用内存来运行该软件吗?

这个问题折磨了我很多,谢谢你的帮助 . 以下是我的工作以及有关集群机器的一些信息:

  • 安装openmpi并提交作业

1)工作脚本:

#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#

export PATH=/tools/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/tools/openmpi/lib:$LD_LIBRARY_PATH
soapPath="/tools/SOAPdenovo2/SOAPdenovo-63mer"
workPath="/NGS"
outputPath="assembly/soap/demo"
/tools/openmpi/bin/mpirun $soapPath all -s $workPath/$outputPath/config_file -K 23 -R -F -p 60 -V -o $workPath/$outputPath/graph_prefix > $workPath/$outputPath/ass.log 2> $workPath/$outputPath/ass.err

2)提交工作:

qsub -pe orte 60 mpi.qsub

3)登录ass.err

a)根据日志,它似乎多次运行soapdenovo

cat ass.err | grep "Pregraph" | wc -l
60

b)详细信息

less ass.err (it seemed it only run soapdenov several times, because when I run it in my machine, it would only output one Pregraph):


Version 2.04: released on July 13th, 2012
Compile Apr 27 2016     15:50:02

********************
Pregraph
********************

Parameters: pregraph -s /NGS/assembly/soap/demo/config_file -K 23 -p 16 -R -o /NGS/assembly/soap/demo/graph_prefix 

In /NGS/assembly/soap/demo/config_file, 1 lib(s), maximum read length 35, maximum name length 256.


Version 2.04: released on July 13th, 2012
Compile Apr 27 2016     15:50:02

********************
Pregraph
********************

and so on

c)stdin的信息

cat ass.log:

--------------------------------------------------------------------------
WARNING: A process refused to die despite all the efforts!
This process may still be running and/or consuming resources.

Host: smp03
PID:  75035

--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 58 with PID 0 on node c0214.local exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
  • 有关集群的信息:

1)qconf -sql

all.q
smp.q

2)qconf -spl

mpi
mpich
orte
zhongxm

3)qconf -sp zhongxm

pe_name            zhongxm
slots              999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

4)qconf -sq smp.q

qname                 smp.q
hostlist              @smp.q
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make zhongxm
rerun                 FALSE
slots                 1
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

5)qconf -sq all.q

qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make zhongxm
rerun                 FALSE
slots                 16,[c0219.local=32]
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            mobile
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

2 回答

  • 0

    根据https://hpc.unt.edu/soapdenovo,该软件不支持MPI:

    此代码不是使用MPI编译的,只能通过线程模型在SINGLE节点上并行使用 .

    因此,您不能只在集群上使用mpiexec启动软件以访问更多内存 . 群集机器与非相干网络(以太网,Infiniband)连接,这些网络比内存总线慢,群集中的PC不共享内存 . 集群使用MPI库(OpenMPI或MPICH)来处理网络,节点之间的所有请求都是显式的:程序在一个进程中调用MPI_Send,在另一个进程中调用MPI_Recv . 还有单向调用,如MPI_Put / MPI_Get访问远程内存(RDMA - 远程直接内存访问),但这与本地内存不同 .

  • 0

    osgx,非常感谢您的回复,并对此消息的延迟表示歉意 .

    由于我不是计算机专业,我觉得我不能很好地理解一些词汇表,比如ELF . 所以有一些新的问题,我列出我的问题如下,感谢帮助advace:

    1)当我“使用SOAPdenovo-63mer”时,它输出“不是动态可执行文件”,这是否意味着“你提到的代码不符合MPI”?

    2)总之,我无法解决集群的问题,我必须寻找一台512G以上内存的机器?

    3)另外,我使用另一个名为ALLPATHS-LG(http://www.broadinstitute.org/software/allpaths-lg/blog/)的软件因内存不足而失败,根据FAQ C1(http://www.broadinstitute.org/software/allpaths-lg/blog/?page_id=336),"it uses share memory parallelization"的意思是,这意味着它可以在集群中使用内存,或者只使用内存在节点中,我必须找到一台有足够内存的机器?

    C1. Can I run ALLPATHS-LG on a cluster?
    You can, but it will only use one machine, not the entire cluster.  That machine would need to have enough memory to fit the entire assembly. ALLPATHS-LG does not support distributed computing using MPI, instead it uses Shared Memory Parallelization.
    

    顺便说一句,这是我第一次发布在这里,我想我应该使用commit来回复,考虑到这么多的话,我用“回答你的问题” .

相关问题