从FASTA标头中提取术语

时间:2013-06-18 06:32:40

标签: python biopython fasta

我需要为以下术语解析FASTA标题:叶子,芽,茎和嫩芽,如果序列包含任何一个术语,那么我打开一个文件并使用Biopython将其放在那里。

所以我使用SeqIO.to_dict将它们转换为字典:

from Bio import SeqIO
records_dict = SeqIO.to_dict(SeqIO.parse("my_example.fasta","fasta"))

但现在我不知道如何从标题中获取条款。序列看起来像这样:

>gi|393741877|gb|FS945568.1|FS945568 FS945568 tea plant lateral roots cDNA library Camellia sinensis cDNA clone LR29G09, mRNA sequence
CCGGGGATCCATTCCAAAATTCATCATAAACCTCTCAATATTGTTCACTTGAAAAAAGATGA...

>gi|393741878|gb|FS945569.1|FS945569 FS945569 tea plant lateral roots cDNA library Camellia sinensis cDNA clone LR29G11, mRNA sequence
CCGGGGGCTATCGAGCACTCACCGACTCACTCGAGAGCTAATACAGTCCACAGC...

>gi|393751846|gb|FS959695.1|FS959695 FS959695 tea plant young leaves cDNA library Camellia sinensis cDNA clone YL16A05, mRNA sequence
CCAACAACTTCTTCCTAACACTACCACCTTCTGTCAACTTACTTCTCCAAAGGCTTCTTTCTTCCACCAT
GGCTGCTTCTACCATGGCTCTCTCTTCCCCATCTTTCGCCGGAAAGGCGGTGAAACTTGCCCCGGAG...

>gi|393751847|gb|FS959696.1|FS959696 FS959696 tea plant young leaves cDNA library Camellia sinensis cDNA clone YL16A06, mRNA sequence
GAAACTGCATATAGAAAATCTCACTACCACTCTCTTCCTCTTCCTCTCTATCTTTCCTACCAAAGAAAG...

>gi|393750830|gb|FS956287.1|FS956287 FS956287 tea plant terminal buds cDNA library Camellia sinensis cDNA clone TB26G04, mRNA sequence
AGGATCGCACGGCCTTTGTGCCGGCGACGCATCATTCAAATTTCTGCCCTATCAACTTTCGATGGTAGGA
TAGT...

>gi|393750831|gb|FS956288.1|FS956288 FS956288 tea plant terminal buds cDNA library Camellia sinensis cDNA clone TB26G05, mRNA sequence
TCCCACAAACATGTTGCTCTCATCTTTCCAGTAAAAGATAGAGAGAGAGAGAGAGAGAACAAAGCAG...

1 个答案:

答案 0 :(得分:1)

不要转换为字典 - 您需要每个defline的说明(使用to_dict()仅使id密钥)。

描述只是一个字符串,您可以在其中搜索术语。 按类别细分记录(可能每个记录属于多个类别),然后使用SeqIO.write()保存到文件:

import os
from Bio import SeqIO

records = SeqIO.parse("my_example.fasta", "fasta")

terms = ["leaves", "buds", "stems", "tender shoots"]
categorized_records = {term: [] for term in terms}

for record in records:
    for term in terms:
        if term in record.description:
            categorized_records[term].append(record)

for term, records in categorized_records.items():
    fasta_out = "%s.fasta" % term
    SeqIO.write(records, fasta_out, 'fasta')  # Will overwrite file