如何使用Biopython翻译FASTA文件中的一系列DNA序列并将蛋白质序列提取到一个单独的字段中?

时间:2018-03-02 16:22:50

标签: python parsing bioinformatics biopython fasta

我是Biopython的新手(并且编码一般),我正在尝试编写一种方法,将一系列DNA序列(超过80个)翻译成蛋白质序列,在一个单独的FASTA文件中。我想在正确的阅读框中找到序列。

这是我到目前为止所拥有的:

          A'-B'-C'  <-- branch
         /
...--o--o
         \
          A--B--C   [branch@{1}]

我当前代码的问题是,虽然它似乎有用,但它只提供输入文件的最后一个序列。任何人都可以帮我弄清楚如何编写所有序列?

谢谢!

2 个答案:

答案 0 :(得分:2)

正如其他人所提到的,在尝试编写结果之前,您的代码会遍历整个输入。我想建议一个人如何用流媒体方法做到这一点:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

with open("AAseq.fasta", 'w') as aa_fa:
    for dna_record in SeqIO.parse("dnaseq.fasta", 'fasta'):
        # use both fwd and rev sequences
        dna_seqs = [dna_record.seq, dna_record.seq.reverse_complement()]

        # generate all translation frames
        aa_seqs = (s[i:].translate(to_stop=True) for i in range(3) for s in dna_seqs)

        # select the longest one
        max_aa = max(aa_seqs, key=len)

        # write new record
        aa_record = SeqRecord(max_aa, id=dna_record.id, description="translated sequence")
        SeqIO.write(aa_record, aa_fa, 'fasta')

这里的主要改进是:

  1. 在每次迭代中翻译并输出单个记录,从而最大限度地减少内存使用。
  2. 添加对反向补充的支持。
  3. 翻译的帧是通过生成器理解创建的,只存储最长的帧。
  4. 通过使用带有密钥的if...elif...else来避免max结构。

答案 1 :(得分:1)

您的if超出了for循环,所以它只应用一次,使用的变量具有循环最后一次迭代结束时的值。如果您希望每次迭代都发生if,则需要将其缩进到与之前的代码相同的级别:

for record in SeqIO.parse("dnaseq.fasta", "fasta"):
    protein_id = record.id
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)
    # Same indentation level, still in the loop
    if len(protein1) > len(protein2) and len(protein1) > len(protein3):
        protein = protein1
    elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
        protein = protein2
    else:
        protein = protein3

您的函数prot_record使用proteinprotein_id的当前值,它们再次是for循环最后一次迭代结束时的内容。< / p>

如果我正确地猜测你想要什么,一种可能性就是将这个函数声明放在循环中,以便函数根据循环的当前迭代具有一个特定的行为,并保存当在记录上再次迭代时,列表中的函数供以后使用。但我不确定这是否有效:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

# List of functions:
record_makers = []
for record in SeqIO.parse("dnaseq.fasta", "fasta"):
    protein_id = record.id
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)
    # still in the loop
    if len(protein1) > len(protein2) and len(protein1) > len(protein3):
        protein = protein1
    elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
        protein = protein2
    else:
        protein = protein3
    # still in the loop
    def prot_record(record):
        return SeqRecord(seq = protein, \
                 id = ">" + protein_id, \
                 description = "translated sequence")
    record_makers.append(prot_record)

# zip the functions and the records together instead of
# mapping one single function to all the records
records = [record_maker(record) for (
    record_maker, record) in zip(
        record_makers, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")]

另一种可能的方法是将翻译逻辑放在记录制作功能中:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

def find_translation(record):
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)

    if len(protein1) > len(protein2) and len(protein1) > len(protein3):
        protein = protein1
    elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
        protein = protein2
    else:
        protein = protein3
    return protein

def prot_record(record):
    protein = find_translation(record)
    # By the way: no need for backslashes here
    return SeqRecord(seq = protein,
                     id = ">" + record.id,
                     description = "translated sequence")

records = map(prot_record, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")]

这可能更干净。我没有经过测试。