从fasta中提取多个序列,但名称不同

时间:2014-02-01 08:25:22

标签: python biopython

我正在尝试从基于ID列表的fasta文件中提取序列子集,到目前为止一切顺利。 我的问题是我的ID列表包含一个额外的第二列(代表序列的编码部分),我想将其保留在新的fasta文件中

File 1: Id list
>TCONS_00000004  654:819
>TCONS_00000006  238:367
>TCONS_00000009  956:1555

File 2: fasta file
>TCONS_00000004
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG 
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
>TCONS_00000006
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG 
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
>TCONS_00000009
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG 
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG

Expected outcome:
 >TCONS_00000004 654:819
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG 
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
>TCONS_00000006 238:367
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG 
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
>TCONS_00000009 956:1555
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG 
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG

我尝试使用以下biopython命令,但它只会从file2中提取而不需要我需要的其他数字。

from Bio import SeqIO
id = []
for line in open("test.txt","r"):
    id.append(line.rstrip().strip('\t'))
for rec in SeqIO.parse("mymodified_transcript.fa","fasta"):
    if rec.id in id:
        print rec.format("fasta")

如何保留附加数字并从file2中提取序列?或者用文件1中的名称替换文件2中的名称? 谢谢你的帮助

2 个答案:

答案 0 :(得分:1)

我得到了解决方案。它在我的Ubuntu中运行良好。请试试这个:)

from Bio import SeqIO
temp = {}
for line in open("test.txt","r"):
    i, c = line.strip().split()
    temp[i] = c

for rec in SeqIO.parse("mymodified_transcript.fa","fasta"):
    if str('>'+rec.id) in temp.keys():
        print str('>'+rec.id), temp['>'+rec.id]
        print str(rec.seq)

答案 1 :(得分:0)

为什么不使用字典进行id查找而不是列表呢?例如,

from Bio import SeqIO
id = {}
for line in open("test.txt","r"):
    i, c = line.strip().split()
    id[i] = c
for rec in SeqIO.parse("mymodified_transcript.fa","fasta"):
    if rec.id in id:
        print rec.id, id[rec.id]
        print rec.format("fasta")