Question

我有数千个DNA序列，范围介于100到5000 bp之间，我需要对齐并计算指定对的同一性分数。 Biopython pairwise2做得很好但只适用于短序列，当序列大小超过2kb时，它会显示严重的内存泄漏，导致'MemoryError'，即使使用'score_only'和'one_alignment_only'选项!!

whole_coding_scores={}
from Bio import pairwise2
for genes in whole_coding: # whole coding is a <25Mb dict providing DNA sequences
   alignment=pairwise2.align.globalxx(whole_coding[genes][3],whole_coding[genes][4],score_only=True,one_alignment_only=True)
   whole_coding_scores[genes]=alignment/min(len(whole_coding[genes][3]),len(whole_coding[genes][4]))

超级计算机返回的结果：

Max vmem         = 256.114G  #Memory usage of the script
failed assumedly after job because:
job 4945543.1 died through signal XCPU (24)

我知道还有其他的对齐工具，但是他们主要可以在输出文件中编写得分，需要再次读取和解析以检索和使用对齐分数。是否有任何工具可以对齐序列并在pairthon2中返回对齐分数，但是没有内存泄漏？

Answer 1

首先，我使用了BioPython的针。一个很好的howto（忽略遗留设计:-)）可以找到here

第二次：也许你可以避免使用生成器将整个集合放入内存中？我不知道你的'whole_coding'对象来自哪里。但是，如果它是一个文件，请确保您不读取整个文件，然后迭代内存对象。例如：

whole_coding = open('big_file', 'rt').readlines() # Will consume memory

但

for gene in open('big_file', 'rt'):     # will not read the whole thing into memory first
    process(gene)

如果你需要处理，你可以编写一个生成器函数：

def gene_yielder(filename):
    for line in open('filename', 'rt'):
        line.strip()   # Here you preprocess your data
        yield line     # This will return

然后

for gene in  gene_yielder('big_file'):
    process_gene(gene)

基本上，您希望程序充当管道：事物流经它并得到处理。准备肉汤时不要用它作为烹饪锅：加入所有东西，然后加热。我希望这种比较不是遥不可及： - ）

Answer 2

对于全局对齐，可以尝试NWalign https://pypi.python.org/pypi/nwalign/。我没有使用它，但似乎你可以恢复脚本中的对齐分数。

否则EMBOSS工具可能会有所帮助：http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/needleall.html

Answer 3

Biopython可以（现在）。 Biopython中的pairwise2模块。 1.68（更快）并且可以采用更长的序列。以下是新旧pairwise2的比较（在32位Python 2.7.11上有2 GB内存限制，64位Win7，Intel Core i5,2.8 GHz）：

from Bio import pairwise2

length_of_sequence = ...
seq1 = '...'[:length_of_sequence]  # Two very long DNA sequences, e.g. human and green monkey
seq2 = '...'[:length_of_sequence]  # dystrophin mRNAs (NM_000109 and XM_007991383, >13 kBp)
aln = pairwise2.align.globalms(seq1, seq2, 5, -4, -10, -1)

旧的pairwise2
- 最大长度/时间：〜1,900个字符/ 10秒
新的pairwise2
- 最大长度/时间：~7450个字符/ 12秒
- 时间为1,900个字符：1秒

将score_only设置为True，新的pairwise2可以在6秒内完成两个~8400个字符的序列。

在python中对齐DNA序列

3 个答案: