我正在尝试使用NCBIWWW获得特定蛋白质的blastP结果。问题是发回的东西不是我认为的对齐数据,我得到的就是这个(这是源代码中'Blast_record'的内容); 我正在使用从'BioPython教程和食谱'获得的代码,我已经搜索它和互联网的错误来源,但我只是找不到它。我的源代码是这样的;
# biopython
from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
# first get the sequence we want to parse from a FASTA file
# f_record = next(SeqIO.parse('m_cold.fasta', 'fasta'))
print('Doing the BLAST and retrieving the results...')
result_handle = NCBIWWW.qblast('blastp', 'tsa', '365176198')
# save the results for later, in case we want to look at it
save_file = open('m_cold_blast.out', 'w')
blast_results = result_handle.read()
save_file.write(blast_results)
save_file.close()
生成的文件是:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastp</BlastOutput_program>
<BlastOutput_version>BLASTP 2.2.31+</BlastOutput_version>
<BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
<BlastOutput_db>tsa</BlastOutput_db>
<BlastOutput_query-ID>gi|365176198|gb|AEW67975.1|</BlastOutput_query-ID>
<BlastOutput_query-def>polyprotein [Black queen cell virus] </BlastOutput_query-def>
<BlastOutput_query-len>171</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_matrix>BLOSUM62</Parameters_matrix>
<Parameters_expect>10</Parameters_expect>
<Parameters_gap-open>11</Parameters_gap-open>
<Parameters_gap-extend>1</Parameters_gap-extend>
<Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>gi|365176198|gb|AEW67975.1|</Iteration_query-ID>
<Iteration_query-def>polyprotein [Black queen cell virus]</Iteration_query-def>
<Iteration_query-len>171</Iteration_query-len>
<Iteration_hits>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>0</Statistics_db-num>
<Statistics_db-len>0</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>-1</Statistics_kappa>
<Statistics_lambda>-1</Statistics_lambda>
<Statistics_entropy>-1</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
现在,如果我使用BlastN和上面使用的蛋白质的核苷酸序列进行搜索,我会得到所有匹配的序列,它们的E值和分数等。那么为什么使用BlastP时不是这样呢?
我对Python和Biopython都很陌生,对于我的生活,我无法弄清楚我做错了什么。
答案 0 :(得分:1)
QUERY 365176198是一种蛋白质
DATABASE 是有核心的
什么是Transcriptome Shotgun Assembly(TSA)数据库?
TSA是来自原始数据(如EST,痕迹和下一代测序技术)的计算组装序列的存档。通过计算方法将完整转录组的重叠序列读数组装成转录本,而不是通过克隆cDNA的传统克隆和测序。
BLA搜索是否可以获得TSA序列?
Transcriptome Shotgun Assembly(TSA)BLAST数据库现已推出。序列最初包含在nt中,但现在已被分离到单独的数据库中。 TSA数据库可从BLAST主页的Basic BLAST下的核苷酸,tblastn和tblastx链接获得。这些序列在nt。
中不可用BLAST FLAVORS
blastp: compares an amino acid query sequence against a protein sequence database blastn: compares a nucleotide query sequence against a nucleotide sequence database blastx: compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastn: compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames tblastx: compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.
手动答案
BIOPYTHON ANSWER
你必须使用&#34; tsa_nt&#34;而不是&#34; tsa&#34;,&#34; tblastn&#34;而不是&#34; blastp&#34;
query = '365176198'
#note: this may take several minutes
result_handle = NCBIWWW.qblast('tblastn', 'tsa_nt', query, format_type="Text")
你得到:
....... Score E Sequences producing significant alignments: (Bits) Value gb|GAZV01037943.1| Apis mellifera comp13466_c0_seq1 transcrib... 342 4e-105 gb|GAZF01116856.1| Essigella californica C629542 transcribed ... 179 1e-54 gb|GBYB01008381.1| Fopius arisanus c20283_g1_i1 transcribed R... 149 2e-37 gb|GAUO01000423.1| Velia caprai s423_L_1942_0 transcribed RNA... 58.9 3e-07 gb|GAXG01028220.1| Gynaikothrips ficorum s28263_L_292921_0 tr... 57.0 9e-07 gb|GAWP01023404.1| Grylloblatta bifratrilecta s23438_L_295244... 52.8 4e-05 gb|GAWZ01143177.1| Gryllotalpa sp. AD-2013 C589197 transcribe... 45.4 0.002 gb|GAXW01013938.1| Euroleon nostras s13984_L_116369_0 transcr... 45.8 0.006 gb|GAXC01050700.1| Thrips palmi C235436 transcribed RNA sequence 42.0 0.017 gb|GAXH01037906.1| Parides eurimedes C235744 transcribed RNA ... 40.4 0.069 gb|GBES01007135.1| Dichelops melacanthus Locus_17334_Transcri... 39.7 0.18 gb|GBXI01014067.1| Bactrocera cucurbitae c16593_g1_i1 transcr... 40.4 0.49 gb|GAMC01001920.1| Ceratitis capitata comp55379_c0_seq1 mRNA ... 40.4 0.49 gb|GARL01030594.1| Spodoptera exigua SEUC25635_TC01 transcrib... 39.7 0.88 gb|GAZS01034153.1| Acanthoscurria geniculata L2169_T1/2_Turan... 38.5 2.1 gb|GAZS01034154.1| Acanthoscurria geniculata L2170_T2/2_Turan... 38.5 2.1 gb|GAYD01030921.1| Blaberus atropos s30958_L_499964_0 transcr... 38.5 2.2 gb|GAZR01021123.1| Stegodyphus mimosarum L19863_T1/1_Velvet_W... 36.2 3.0 gb|GBCX01022664.1| Dastarcus helophoroides Unigene14575 trans... 37.7 3.8 gb|GAYF01148415.1| Nilaparvata lugens C730037 transcribed RNA... 37.7 4.5 gb|GAMK01054259.1| Phaseolus vulgaris Ref_259_comp8866_c0_seq... 37.0 8.5 gb|EZ343106.1| Artemisia annua strain Uganda Contig10322.Uhm ... 35.8 8.9 ALIGNMENTS >gb|GAZV01037943.1| TSA: Apis mellifera comp13466_c0_seq1 transcribed RNA sequence Length=6998 Score = 342 bits (878), Expect = 4e-105, Method: Compositional matrix adjust. Identities = 164/168 (98%), Positives = 167/168 (99%), Gaps = 0/168 (0%) Frame = +2 Query 4 YALYRGGVRVKVVTGRGVDFVRATVSPQQTYGSEVAPTTHISTPLAIEQIPIKGVAEFQI 63 YALYRGGVRVKVVT +GVDFVRATVSPQQTYGS+VAPTTHISTPLAIEQIPIKGVAEFQI Sbjct 6317 YALYRGGVRVKVVTEKGVDFVRATVSPQQTYGSDVAPTTHISTPLAIEQIPIKGVAEFQI 6496 Query 64 PYYAPCLSSSFRANSETFYYSSGRNNLDIATSPPSINRYYAVGAGDDMDFSIFIGTPPCI 123 PYYAPCLSSSFRANSETFYYSSGRNNLDI+TSPPSINRYYAVGAGDDMDFSIFIGTPPCI Sbjct 6497 PYYAPCLSSSFRANSETFYYSSGRNNLDISTSPPSINRYYAVGAGDDMDFSIFIGTPPCI 6676 .......
答案 1 :(得分:0)
我注意到你的输出文件以'.out'结尾。尝试将其保存到XML文件,并将它们映射到漂亮的列。在输出文件的第一行,你会看到'?xml。 qBLAST函数默认使用XML,并且“Text”也有一个可选参数,尽管格式化是一场噩梦。
blastp和tsa也是不同的数据库。 qblast模块有一些内置的帮助,可以帮助使用不同的参数,可以使用它来访问。
>>> from Bio.Blast import NCBIWWW
>>> help(NCBIWWW.qblast)
(我会运行你的代码,但qBLAST在撰写本文时遇到了问题)