Question

我需要一种方法，用于创建不同长度的3-1000个短（10-20bp）核苷酸（＆＃34; ATCG＆＃34;）读数的共有序列。

简化示例：

"AGGGGC"
"AGGGC"
"AGGGGGC"
"AGGAGC"
"AGGGGG"

应该产生"AGGGGC"的共识序列。

我发现在BioPython库中进行多序列比对（MSA）的模块，但仅适用于相同长度的序列。我也熟悉（并已实施）Smith-Waterman样式对齐任意长度的两个序列。我想必须有一个结合了这些元素的库或实现（MSA而不是不等的lentghs），但经过几个小时的搜索，网络和各种文档都找不到任何东西。

有关现有模块/库（Python首选）或程序的任何建议我可以合并到管道中吗？

谢谢！

Answer 1

在序列末尾添加间隙字符，使它们都具有相同的长度。然后，您可以使用您选择的程序处理它们，例如MUSCLE：

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align.Applications import MuscleCommandline

sequences = ["AGGGGC",
             "AGGGC",
             "AGGGGGC",
             "AGGAGC",
             "AGGGGG"]

longest_length = max(len(s) for s in sequences)
padded_sequences = [s.ljust(longest_length, '-') for s in sequences]
records = (SeqRecord(Seq(s)) for s in padded_sequences)

SeqIO.write(records, "msa_example.fasta", "fasta")

from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="msa_example.fasta", out="msa.txt")
print cline

具有不等长串的多序列比对

1 个答案: