我有一个序列在某些物种中有同源物和这些同源物的得分。
这是来自gff文件的示例记录:
4592637 Beutenbergia_cavernae_DSM_12333 TILL 70731 70780 . 0 . clst_id=429;SubjectOrganism=Thermofilum_pendens_Hrk_5;SubjectScore=0.343373493975904;SubjectOrganism=Ignicoccus_hospitalis_KIN4_I;SubjectScore=0.323293172690763;SubjectOrganism=Burkholderia_pseudomallei_MSHR346;SubjectScore=0.343373493975904;SubjectOrganism=Burkholderia_mallei_SAVP1;SubjectScore=0.343373493975904;SubjectOrganism=Enterobacter_638;SubjectScore=0.343373493975904;SubjectOrganism=Rickettsia_felis_URRWXCal2;SubjectScore=0.343373493975904;SubjectOrganism=Gemmatimonas_aurantiaca_T_27;SubjectScore=0.343373493975904;SubjectOrganism=Streptomyces_coelicolor;SubjectScore=0.363453815261044;SubjectOrganism=Beutenbergia_cavernae_DSM_12333;SubjectScore=1;SubjectOrganism=Kocuria_rhizophila_DC2201;SubjectScore=0.343373493975904;SubjectOrganism=Rhodococcus_jostii_RHA1;SubjectScore=0.383534136546185;SubjectOrganism=Symbiobacterium_thermophilum_IAM14863;SubjectScore=0.363453815261044;
==> 4592637 => NAPP(核酸系统发育谱分析数据库)序列ID(不是genbank id )
==> Beutenbergia_cavernae_DSM_12333 =>序列的物种名称
==> TILL =>序列类型
==> 70731 .. 70780 =>序列的开始和结束
==> clst_id = 429 =>是该序列的簇的id
==> SubjectOrganism =>序列具有同源物的物种名称 用它
==> SubjectScore =>用这个物种得到序列的同源物 (Blastn得分)
我想从序列(4592637)具有相似性的SubjectOrganism
中提取序列。
如何从使用Python的序列具有同源物的基因组中提取序列?
答案 0 :(得分:0)
您可以简单地将该序列作为字符串,然后根据需要对其进行切片。例如:
>>> s="abcdefghij"
>>> len(s)
10
>>> s[5:10]
'fghij'
>>>
将s
视为完整字符串,并将5:10
替换为70731:70780
。希望有所帮助!
答案 1 :(得分:0)
从另一个question,我想你已经想到了这一点。如果是这种情况,StackOverflow encourages你回答自己的问题,发布并接受它们!无论如何:
首先,您获取查询序列,将id
替换为您的有机体的ID。我发现它用“Beutenbergia cavernae DSM 12333”查询NCBI:
from Bio import Entrez
seq = Entrez.efetch(db="nuccore",
id="229564415",
rettype="fasta",
seq_start=70731,
seq_stop=70780).readlines()
现在seq
包含类似
['>gb|CP001618.1|:70731-70780 Beutenbergia cavernae DSM 12333,'
'complete genome\n',
'GCCCGAGTTCCCCGAACCGTGCCGAGGTAGTACTCCACGGGCGAGGGAGT\n',
'\n']
使用此序列启动qblast,如另一个问题所示,但将硬编码的entrez_query
替换为GFF文件中的字符串:
from Bio.Blast.NCBIWWW import qblast
results = qblast("blastn",
"nr",
"".join(seq),
entrez_query='Thermofilum_pendens_Hrk_5')
要小心,就像成千上万的查询一样,NCBI肯定会禁止你排队等候。