我有一个FASTA文件,这是一个巨大的文件,我想采取那些拥有智人的序列。我们可以使用字典和列表等方法来获得结果。但由于巨大的尺寸,我们无法使用记忆。我们必须将结果写入文件。我的示例FASTA文件如下
GI | 489223532 | REF | WP_003131952.1 | 30S核糖体蛋白S18 [乳酸乳球菌] MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDLTRYYDG
GI | 66816243 | REF | XP_642131.1 |假定蛋白DDB_G0277827 [智人] MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ
GI | 66818355 | REF | XP_642837.1 |假定蛋白DDB_G0276911 [粘菌AX4] MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYEDFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKRIEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM
GI | 446106212 | REF | WP_000184067.1 | MULTISPECIES:抗生素转运蛋白[Homo sapiens] MTNPFENDNYTYKVLKNEEGQYSLWPAFLDVPIGWNVVHKEASRNDCLQYVENNWEDLNPKSNQVGKKILVGKR
GI | 494110381 | REF | WP_007051162.1 |多物种:精氨基琥珀酸裂解酶[双歧杆菌属] MTENNEHLALWGGRFTSGPSPELARLSKSTQFDWRLADDDIAGSRAHARALGRAGLLTADELQRMEDALDTLQRHVDDGSFAPIEDDEDEATALERGLIDIAGDELGGKLRAGRSRNDQIACLIRMWLRRHSRVIAGLLLDLVNALIEQSEKAGRTVMPGRTHMQHAQPVLLAHQLMAHAWPLIRDVQRLIDWDKRINASPYGSGALAGNTLGLDPEAVARELGFIDGAD
预期输出
GI | 66816243 | REF | XP_642131.1 |假定蛋白DDB_G0277827 [智人] MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ
GI | 446106212 | REF | WP_000184067.1 | MULTISPECIES:抗生素转运蛋白[Homo sapiens] MTNPFENDNYTYKVLKNEEGQYSLWPAFLDVPIGWNVVHKEASRNDCLQYVENNWEDLNPKSNQVGKKILVGKR
答案 0 :(得分:1)
你应该在你的问题上表现出努力,因为你显然没有尝试过。我只是回答,因为它是3行。
for line in f:
if('Homo sapiens' in line):
print line+'\n'
修改强>
如果标题信息后面有一个新行,那么你需要一段更笨重的代码,但它会很快通过文件。
f = open('/Users/nfirth/Downloads/file.fasta')
swapLine = False
for line in f:
if(swapLine):
line = line2
swapLine = False
if('Homo sapien' in line):
print line,
line2 = f.next()
while('>' not in line2):
print line2,
line2 = f.next()
swapLine = True
f.close()
答案 1 :(得分:0)
from Bio import SeqIO
my_data = []
with open("test.fasta", "r") as handle:
for record in SeqIO.parse(handle, 'fasta'):
if 'Homo sapiens' in record.name:
my_data.append(str(record.seq))
with open("output.fasta", "w") as out:
for item in my_data:
out.write("{0}\n===End===\n".format(item))