我正在尝试从基于ID列表的fasta文件中提取序列子集,到目前为止一切顺利。 我的问题是我的ID列表包含一个额外的第二列(代表序列的编码部分),我想将其保留在新的fasta文件中
File 1: Id list
>TCONS_00000004 654:819
>TCONS_00000006 238:367
>TCONS_00000009 956:1555
File 2: fasta file
>TCONS_00000004
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
>TCONS_00000006
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
>TCONS_00000009
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
Expected outcome:
>TCONS_00000004 654:819
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
>TCONS_00000006 238:367
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
>TCONS_00000009 956:1555
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
我尝试使用以下biopython命令,但它只会从file2中提取而不需要我需要的其他数字。
from Bio import SeqIO
id = []
for line in open("test.txt","r"):
id.append(line.rstrip().strip('\t'))
for rec in SeqIO.parse("mymodified_transcript.fa","fasta"):
if rec.id in id:
print rec.format("fasta")
如何保留附加数字并从file2中提取序列?或者用文件1中的名称替换文件2中的名称? 谢谢你的帮助
答案 0 :(得分:1)
我得到了解决方案。它在我的Ubuntu中运行良好。请试试这个:)
from Bio import SeqIO
temp = {}
for line in open("test.txt","r"):
i, c = line.strip().split()
temp[i] = c
for rec in SeqIO.parse("mymodified_transcript.fa","fasta"):
if str('>'+rec.id) in temp.keys():
print str('>'+rec.id), temp['>'+rec.id]
print str(rec.seq)
答案 1 :(得分:0)
为什么不使用字典进行id查找而不是列表呢?例如,
from Bio import SeqIO
id = {}
for line in open("test.txt","r"):
i, c = line.strip().split()
id[i] = c
for rec in SeqIO.parse("mymodified_transcript.fa","fasta"):
if rec.id in id:
print rec.id, id[rec.id]
print rec.format("fasta")