我试图从数据库中检索特定的RNA UTR序列。 从数据库中我得到一个.dat文件,其中每个RNA UTR由如下条目表示:
ID 5MMUR018955; SV 1; linear; mRNA; STD; MUS; 54 BP. XX AC BR058092; XX DT 01-JUL-2009 (Rel. 9, Created) DT 01-JUL-2009 (Rel. 9, Last updated, Version 1) XX DE 5'UTR in Mus musculus neutrophilic granule protein (Ngp), mRNA. XX DR REFSEQ; NM_008694; DR UTRef; CR062409; DR GeneID; 18054; XX OS Mus musculus (house mouse) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea; OC Muridae; Murinae; Mus; Mus. XX UT 5'UTR; 1 exon(s) XX FH Key Location/Qualifiers FH FT source 1..54 FT /organism="Mus musculus" FT /mol_type="mRNA" FT /strain="C57BL/6" FT /db_xref="taxon:10090" FT 5'UTR 1..54 FT /source="REFSEQ::NM_008694:1..54" FT /gene="Ngp" FT /product="neutrophilic granule protein" FT /gene_synonym="bectenecin" FT /genome="chr9:110322312-110322365:+" XX SQ Sequence 54 BP; 19 A; 9 C; 14 G; 12 T; 0 other; agtctcaata tcatctacat aaaaggggcc aagagtggta gtgtgtcaga gaca 54 //
我有一个基因名称列表(存储在行FT /gene="Ngp"
中),我想用它来访问存储在行SQ Sequence 54 BP; 19 A; 9 C; 14 G; 12 T; 0 other;
agtctcaata tcatctacat aaaaggggcc aagagtggta gtgtgtcaga gaca 54
检索之后,我想将两者都写成一个fasta格式的新文件,即
>Ngp
agtctcaatatcatctacataaaaggggccaagagtggtagtgtgtcagagaca"
有没有简单的方法在python中执行此操作?我一整天都在和它斗争,并没有真正得到任何地方,非常感谢你的帮助。
答案 0 :(得分:1)
答案 1 :(得分:0)
我使用biopython解析embl文件并提取信息
from Bio import SeqIO
input = "test.embl" #change your input, here
#next if you had one sequence in the input file
seq = SeqIO.parse(open(input), "embl").next()
UTR5 = [feature for feature in seq.features if feature.type=="5'UTR"]
#you have only one 5'utr
genes = UTR5[0].qualifiers['gene']
#you get ['Ngp']
#Create SeqRecord
from Bio.SeqRecord import SeqRecord
#you may remove description, if not required
new_record = SeqRecord(seq.seq, id= "_".join(genes),
name=seq.name, description=seq.description)
print new_record.format("fasta")
你得到:
>Ngp 5'UTR in Mus musculus neutrophilic granule protein (Ngp), mRNA. AGTCTCAATATCATCTACATAAAAGGGGCCAAGAGTGGTAGTGTGTCAGAGACA
答案 2 :(得分:0)
这是一个强大的解决方案,可以在数据库文件中搜索基因列表,以fasta
格式打印结果并列出未找到的基因。
请注意,数据库中可能存在多个相同基因名称的序列记录,因此您可能需要额外的过滤才能准确获取您希望获得的序列。
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
data = "embl.dat" #Path to EMBL database file
search = "gene_names.txt" #Path to file with search terms
#Load the search terms from file and strip linefeed characters
search_genes = open(search, 'r').read().splitlines()
found_genes = []
#Search the EMBL database file
for record in SeqIO.parse(open(data, 'r'), 'embl'):
UTR5 = [feature for feature in record.features if feature.type=="5'UTR"]
for utr5feature in UTR5:
for s in search_genes:
genes = utr5feature.qualifiers['gene']
if s in genes:
found_genes.append(s)
#Gene found. Print a modified copy of the record in the desired format
print SeqRecord(record.seq, id="_".join(genes), name=record.name,
description=record.description).format('fasta')
#List any search terms that were not found in the database
for s in search_genes:
if s not in found_genes:
print s+" NOT FOUND IN DATABASE!"