使用biopython如果基因名称存储在文本文件中,如何从fasta文件中剪切我感兴趣的基因?
#extract genes
f1 = open('ortholog1.txt','r')
f2 = open('all.fasta','r')
f3 = open('ortholog1.fasta','w')
genes = [line.rstrip('\n') for line in f1.readlines()]
i=0
for seq_record in SeqIO.parse(f2, "fasta"):
if genes[i] == seq_record.id:
print genes[i]
f3.write('>'+genes[i])
i=i+1
if i==18:
break
f3.write('\n')
f3.write(str(seq_record.seq))
f3.write('\n')
f2.close()
f3.close()
我正在尝试上面的代码。但它有一些错误并且不是通用的,因为像ortholog1.txt
(包含基因名称)还有5个类似的文件。每个文件中的基因数量也各不相同(这里不是18)。这里all.fasta
是包含所有基因的文件。 ortholog1.fasta
必须包含剪切的核苷酸序列。
答案 0 :(得分:1)
基本上,您可以让Biopython完成所有工作。
我猜测基因名称是" ortholog1.txt"与fasta文件中的完全相同,每行有一个基因名称。如果没有,您需要根据需要调整它们以使它们对齐。
from Bio import SeqIO
with open('ortholog1.txt','r') as f:
orthologs_txt = f.read()
orthologs = orthologs_txt.splitlines()
genes_to_keep = []
for record in SeqIO.parse(open('all.fasta','r'), 'fasta'):
if record.description in orthologs:
genes_to_keep.append(record)
with open('ortholog1.fasta','w') as f:
SeqIO.write(genes_to_keep, f, 'fasta')
编辑:这是保持输出基因与orthologs文件中的顺序相同的一种方法:
from Bio import SeqIO
with open('all.fasta','r') as fasta_file:
record_dict = SeqIO.to_dict(open(SeqIO.parse(fasta_file, 'fasta')
with open('ortholog1.txt','r') as text_file:
orthologs_txt = text_file.read()
genes_to_keep = []
for ortholog in orthologs_txt.splitlines():
try:
genes_to_keep.append( record_dict[ortholog] )
except KeyError:
pass
with open('ortholog1.fasta','w') as output_file:
SeqIO.write(genes_to_keep, output_file, 'fasta')
答案 1 :(得分:1)
我不是biopython专家,因此我会将输出的详细信息留给您,但数据流可以非常简单(代码已注释)
# Initially we have a dictionary where the keys are gene names and the
# values are all empty lists
genes = {gene.strip():[] for gene in open('ortholog1.txt','r')}
# parse the data
for record in SeqIO.parse(open('all.fasta','r'), "fasta"):
# append the current record if its id is in the keys of genes
if record.id in genes:
genes[record.id].append(record)
with open('ortholog1.fasta','w') as fout:
# we read again, line by line, the file of genes
for gene in open('ortholog1.txt','r'):
# if the list associated with the current gene is not empty
if genes[gene.strip()]:
# output the list of records for the current gene using
# biopython facilities
...