我有一个fasta文件(下面提到的第一个序列),描述很长。我需要选择具体的描述字段。当我使用下面的代码;整个描述进入字符串。
from Bio import SeqIO
for record in SeqIO.parse("geneTemp.fasta", "fasta") :
id=record.id
desc=record.description
print desc
有没有简单的方法可以将描述字段(使用biopython库)放入数组并选择特定字段而不将描述带入字符串并吐出字符串?
代码输出
Python 2.7 (r27:82500, Sep 16 2010, 18:03:06)
[GCC 4.5.1 20100907 (Red Hat 4.5.1-3)] on localhost.localdomain, Standard
>>> FBgn0197520 type=gene; loc=scaffold_12855:complement(6241650..6242111); ID=FBgn0197520; name=Dvir\GJ10233; dbxref=FlyBase_Annotation_IDs:GJ10233,FlyBase:FBgn0197520,GLEANR:dvir_GLEANR_10171,EntrezGene:6632532,GB_protein:EDW59542,FlyMine:FBgn0197520,OrthoDB4.Arthropods:FBgn0242841,OrthoDB4.Arthropods:FBgn0213090,OrthoDB4.Arthropods:FBgn0190974,OrthoDB4.Arthropods:FBgn0165423,OrthoDB4.Arthropods:FBgn0247590,OrthoDB4.Arthropods:FBgn0149779,OrthoDB4.Arthropods:FBgn0146205,OrthoDB4.Arthropods:FBgn0017456,OrthoDB4.Arthropods:FBgn0126736,OrthoDB4.Arthropods:FBgn0117264,OrthoDB4.Arthropods:FBgn0094317; MD5=0b7e859d2a6eca028ffd16b964835705; length=462; release=r1.2; species=Dvir;
loc=scaffold_12855:complement(6241650..6242111)
来自fasta文件的序列之一。
>FBgn0207418 type=gene; loc=scaffold_12875:complement(14361770..14363857); ID=FBgn0207418; name=Dvir\GJ20278; dbxref=FlyBase_Annotation_IDs:GJ20278,FlyBase:FBgn0207418,GLEANR:dvir_GLEANR_5721,EntrezGene:6625684,GB_protein:EDW61510,FlyMine:FBgn0207418,OrthoDB4.Arthropods:NV16422,OrthoDB4.Arthropods:LH16819,OrthoDB4.Arthropods:ISCW000548,OrthoDB4.Arthropods:FBgn0239668,OrthoDB4.Arthropods:FBgn0219970,OrthoDB4.Arthropods:FBgn0181866,OrthoDB4.Arthropods:FBgn0175499,OrthoDB4.Arthropods:FBgn0080765,OrthoDB4.Arthropods:FBgn0155230,OrthoDB4.Arthropods:FBgn0141947,OrthoDB4.Arthropods:FBgn0033392,OrthoDB4.Arthropods:FBgn0127494,OrthoDB4.Arthropods:FBgn0102879,OrthoDB4.Arthropods:FBgn0090125,OrthoDB4.Arthropods:CPIJ005729,OrthoDB4.Arthropods:GB15324,OrthoDB4.Arthropods:AGAP012336,OrthoDB4.Arthropods:AAEL007395,OrthoDB4.Arthropods:PB24927,OrthoDB4.Arthropods:PHUM365660,OrthoDB4.Arthropods:GLEAN_06039; MD5=4c62b751ec045ac93306ce7c08d254f9; length=2088; release=r1.2; species=Dvir;
ATGCGTCTGCGACGCCGCTGGCATCGGCGGATGCGGCGTACAATTGAGAA
AATCTATCGCCTTAAAATGCAATCGCGCCGCAAGTTGGTTTACTTAGCCG
TATTTGGAGCACTATGCGTAATATTCTGGCTGGCTGGACAGCAGTTGCTG
ACGACTTCGAATGGTCACTACAGTAGCTACTACGGCGAAACGCATTGTGC
GCCCATTGATGCCGTATACACCTGGGTAAATGGTTCGGATCCGGATTTTA
TTGAGTCCATTAGACGCTACGATGCCAGCTACGATCCGTCGCGCTTCGAC
答案 0 :(得分:5)
FASTA的描述部分不是标准的。您可以使用正则表达式来解析它。
>>> import re
>>> desc = 'FBgn0207418 type=gene; loc=scaffold_12875:complement(14361770..14363857); ...'
>>> fields = dict(re.findall(r'(\w+)=(.*?);', desc))
>>> fields['type']
'gene'
>>> fields['length']
'2088'