使用NCBIWWW从BlastP输出的结果与我的预期不符

时间:2015-05-26 11:19:42

标签: biopython blast

我正在尝试使用NCBIWWW获得特定蛋白质的blastP结果。问题是发回的东西不是我认为的对齐数据,我得到的就是这个(这是源代码中'Blast_record'的内容); 我正在使用从'BioPython教程和食谱'获得的代码,我已经搜索它和互联网的错误来源,但我只是找不到它。我的源代码是这样的;

# biopython
from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

# first get the sequence we want to parse from a FASTA file
# f_record = next(SeqIO.parse('m_cold.fasta', 'fasta'))

print('Doing the BLAST and retrieving the results...')
result_handle = NCBIWWW.qblast('blastp', 'tsa', '365176198')

# save the results for later, in case we want to look at it
save_file = open('m_cold_blast.out', 'w')
blast_results = result_handle.read()
save_file.write(blast_results)
save_file.close()

生成的文件是:

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastp</BlastOutput_program>
<BlastOutput_version>BLASTP 2.2.31+</BlastOutput_version>
<BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&amp;auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
<BlastOutput_db>tsa</BlastOutput_db>
<BlastOutput_query-ID>gi|365176198|gb|AEW67975.1|</BlastOutput_query-ID>
<BlastOutput_query-def>polyprotein [Black queen cell virus]    </BlastOutput_query-def>
<BlastOutput_query-len>171</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
  <Parameters_matrix>BLOSUM62</Parameters_matrix>
  <Parameters_expect>10</Parameters_expect>
  <Parameters_gap-open>11</Parameters_gap-open>
  <Parameters_gap-extend>1</Parameters_gap-extend>
  <Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>gi|365176198|gb|AEW67975.1|</Iteration_query-ID>
<Iteration_query-def>polyprotein [Black queen cell virus]</Iteration_query-def>
<Iteration_query-len>171</Iteration_query-len>
<Iteration_hits>
</Iteration_hits>
<Iteration_stat>
<Statistics>
  <Statistics_db-num>0</Statistics_db-num>
  <Statistics_db-len>0</Statistics_db-len>
  <Statistics_hsp-len>0</Statistics_hsp-len>
  <Statistics_eff-space>0</Statistics_eff-space>
  <Statistics_kappa>-1</Statistics_kappa>
  <Statistics_lambda>-1</Statistics_lambda>
  <Statistics_entropy>-1</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>

现在,如果我使用BlastN和上面使用的蛋白质的核苷酸序列进行搜索,我会得到所有匹配的序列,它们的E值和分数等。那么为什么使用BlastP时不是这样呢?

我对Python和Biopython都很陌生,对于我的生活,我无法弄清楚我做错了什么。

2 个答案:

答案 0 :(得分:1)

QUERY 365176198是一种蛋白质

DATABASE 是有核心的

什么是Transcriptome Shotgun Assembly(TSA)数据库?

TSA是来自原始数据(如EST,痕迹和下一代测序技术)的计算组装序列的存档。通过计算方法将完整转录组的重叠序列读数组装成转录本,而不是通过克隆cDNA的传统克隆和测序。

BLA搜索是否可以获得TSA序列?

Transcriptome Shotgun Assembly(TSA)BLAST数据库现已推出。序列最初包含在nt中,但现在已被分离到单独的数据库中。 TSA数据库可从BLAST主页的Basic BLAST下的核苷酸,tblastn和tblastx链接获得。这些序列在nt。

中不可用

BLAST FLAVORS

blastp: 
     compares an amino acid query sequence against a protein sequence
     database

blastn: compares a nucleotide query sequence against a nucleotide 
     sequence database

blastx: compares a nucleotide query sequence translated in all 
     reading frames against a protein sequence database

tblastn: compares a protein query sequence against a nucleotide 
     sequence database dynamically translated in all reading frames

tblastx: compares the six-frame translations of a nucleotide query
     sequence against the six-frame translations of a nucleotide 
     sequence database. Please note that tblastx program cannot be 
     used with the nr database on the BLAST Web page.

手动答案

input blast manually

output blast manually

BIOPYTHON ANSWER

你必须使用&#34; tsa_nt&#34;而不是&#34; tsa&#34;,&#34; tblastn&#34;而不是&#34; blastp&#34;

query = '365176198'
#note: this may take several minutes
result_handle = NCBIWWW.qblast('tblastn', 'tsa_nt', query, format_type="Text")

你得到:

.......

                                                                   Score     E
Sequences producing significant alignments:                       (Bits)  Value

gb|GAZV01037943.1|  Apis mellifera comp13466_c0_seq1 transcrib...  342     4e-105
gb|GAZF01116856.1|  Essigella californica C629542 transcribed ...  179     1e-54 
gb|GBYB01008381.1|  Fopius arisanus c20283_g1_i1 transcribed R...  149     2e-37 
gb|GAUO01000423.1|  Velia caprai s423_L_1942_0 transcribed RNA...  58.9    3e-07 
gb|GAXG01028220.1|  Gynaikothrips ficorum s28263_L_292921_0 tr...  57.0    9e-07 
gb|GAWP01023404.1|  Grylloblatta bifratrilecta s23438_L_295244...  52.8    4e-05 
gb|GAWZ01143177.1|  Gryllotalpa sp. AD-2013 C589197 transcribe...  45.4    0.002 
gb|GAXW01013938.1|  Euroleon nostras s13984_L_116369_0 transcr...  45.8    0.006 
gb|GAXC01050700.1|  Thrips palmi C235436 transcribed RNA sequence  42.0    0.017 
gb|GAXH01037906.1|  Parides eurimedes C235744 transcribed RNA ...  40.4    0.069 
gb|GBES01007135.1|  Dichelops melacanthus Locus_17334_Transcri...  39.7    0.18  
gb|GBXI01014067.1|  Bactrocera cucurbitae c16593_g1_i1 transcr...  40.4    0.49  
gb|GAMC01001920.1|  Ceratitis capitata comp55379_c0_seq1 mRNA ...  40.4    0.49  
gb|GARL01030594.1|  Spodoptera exigua SEUC25635_TC01 transcrib...  39.7    0.88  
gb|GAZS01034153.1|  Acanthoscurria geniculata L2169_T1/2_Turan...  38.5    2.1   
gb|GAZS01034154.1|  Acanthoscurria geniculata L2170_T2/2_Turan...  38.5    2.1   
gb|GAYD01030921.1|  Blaberus atropos s30958_L_499964_0 transcr...  38.5    2.2   
gb|GAZR01021123.1|  Stegodyphus mimosarum L19863_T1/1_Velvet_W...  36.2    3.0   
gb|GBCX01022664.1|  Dastarcus helophoroides Unigene14575 trans...  37.7    3.8   
gb|GAYF01148415.1|  Nilaparvata lugens C730037 transcribed RNA...  37.7    4.5   
gb|GAMK01054259.1|  Phaseolus vulgaris Ref_259_comp8866_c0_seq...  37.0    8.5   
gb|EZ343106.1|  Artemisia annua strain Uganda Contig10322.Uhm ...  35.8    8.9   

ALIGNMENTS
>gb|GAZV01037943.1| TSA: Apis mellifera comp13466_c0_seq1 transcribed RNA sequence
Length=6998

 Score = 342 bits (878),  Expect = 4e-105, Method: Compositional matrix adjust.
 Identities = 164/168 (98%), Positives = 167/168 (99%), Gaps = 0/168 (0%)
 Frame = +2

Query  4     YALYRGGVRVKVVTGRGVDFVRATVSPQQTYGSEVAPTTHISTPLAIEQIPIKGVAEFQI  63
             YALYRGGVRVKVVT +GVDFVRATVSPQQTYGS+VAPTTHISTPLAIEQIPIKGVAEFQI
Sbjct  6317  YALYRGGVRVKVVTEKGVDFVRATVSPQQTYGSDVAPTTHISTPLAIEQIPIKGVAEFQI  6496

Query  64    PYYAPCLSSSFRANSETFYYSSGRNNLDIATSPPSINRYYAVGAGDDMDFSIFIGTPPCI  123
             PYYAPCLSSSFRANSETFYYSSGRNNLDI+TSPPSINRYYAVGAGDDMDFSIFIGTPPCI
Sbjct  6497  PYYAPCLSSSFRANSETFYYSSGRNNLDISTSPPSINRYYAVGAGDDMDFSIFIGTPPCI  6676

.......

答案 1 :(得分:0)

我注意到你的输出文件以'.out'结尾。尝试将其保存到XML文件,并将它们映射到漂亮的列。在输出文件的第一行,你会看到'?xml。 qBLAST函数默认使用XML,并且“Text”也有一个可选参数,尽管格式化是一场噩梦。

blastp和tsa也是不同的数据库。 qblast模块有一些内置的帮助,可以帮助使用不同的参数,可以使用它来访问。

>>> from Bio.Blast import NCBIWWW
>>> help(NCBIWWW.qblast)

(我会运行你的代码,但qBLAST在撰写本文时遇到了问题)