解析庞大的FASTA文件

时间:2016-01-19 15:47:04

标签: python fasta

我有一个FASTA文件,这是一个巨大的文件,我想采取那些拥有智人的序列。我们可以使用字典和列表等方法来获得结果。但由于巨大的尺寸,我们无法使用记忆。我们必须将结果写入文件。我的示例FASTA文件如下

  

GI | 489223532 | REF | WP_003131952.1 | 30S核糖体蛋白S18 [乳酸乳球菌] MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDLTRYYDG

     

GI | 66816243 | REF | XP_642131.1 |假定蛋白DDB_G0277827 [智人] MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ

     

GI | 66818355 | REF | XP_642837.1 |假定蛋白DDB_G0276911 [粘菌AX4] MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYEDFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKRIEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM

     

GI | 446106212 | REF | WP_000184067.1 | MULTISPECIES:抗生素转运蛋白[Homo sapiens] MTNPFENDNYTYKVLKNEEGQYSLWPAFLDVPIGWNVVHKEASRNDCLQYVENNWEDLNPKSNQVGKKILVGKR

     

GI | 494110381 | REF | WP_007051162.1 |多物种:精氨基琥珀酸裂解酶[双歧杆菌属] MTENNEHLALWGGRFTSGPSPELARLSKSTQFDWRLADDDIAGSRAHARALGRAGLLTADELQRMEDALDTLQRHVDDGSFAPIEDDEDEATALERGLIDIAGDELGGKLRAGRSRNDQIACLIRMWLRRHSRVIAGLLLDLVNALIEQSEKAGRTVMPGRTHMQHAQPVLLAHQLMAHAWPLIRDVQRLIDWDKRINASPYGSGALAGNTLGLDPEAVARELGFIDGAD

预期输出

  

GI | 66816243 | REF | XP_642131.1 |假定蛋白DDB_G0277827 [智人] MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ

     

GI | 446106212 | REF | WP_000184067.1 | MULTISPECIES:抗生素转运蛋白[Homo sapiens] MTNPFENDNYTYKVLKNEEGQYSLWPAFLDVPIGWNVVHKEASRNDCLQYVENNWEDLNPKSNQVGKKILVGKR

2 个答案:

答案 0 :(得分:1)

你应该在你的问题上表现出努力,因为你显然没有尝试过。我只是回答,因为它是3行。

for line in f:
    if('Homo sapiens' in line):
        print line+'\n'

修改

如果标题信息后面有一个新行,那么你需要一段更笨重的代码,但它会很快通过文件。

f = open('/Users/nfirth/Downloads/file.fasta')
swapLine = False
for line in f:
    if(swapLine):
        line = line2
        swapLine = False
    if('Homo sapien' in line):
        print line,
        line2 = f.next()
        while('>' not in line2):
            print line2,
            line2 = f.next()
        swapLine = True
f.close()

答案 1 :(得分:0)

from Bio import SeqIO

my_data = []
with open("test.fasta", "r") as handle:
     for record in SeqIO.parse(handle, 'fasta'):
          if 'Homo sapiens' in record.name:
               my_data.append(str(record.seq))

with open("output.fasta", "w") as out:
     for item in my_data:
          out.write("{0}\n===End===\n".format(item))