解析BLAST xml输出的问题

时间:2014-11-03 17:39:42

标签: python xml parsing blast

我在使用以下python脚本解析一些xml BLAST输出时遇到问题:

#!/usr/bin/env python

import sys
from Bio.Blast import NCBIXML
#Usage, opens an outfile and then parses any number of .xml files into that outfile,  printing all hits
#parse_blastn.py outfile.txt anynumberofinfiles.xml
OUT = open(sys.argv[1], 'w')

OUT.write("Query Name\tQuery Length\tSubject Name\tSubject Length\tAlignment Length\tQuery Start\tQuery End\tSubject Start\tSubject End\tQuery Sequence\tSubject Sequence\tHsp Score\t
Hsp Expect\tHsp Identities\tPercent Match\tNumber_of_gaps")
for xml_file in sys.argv[2:]:
        result_handle = open(xml_file)
        blast_records = NCBIXML.parse(result_handle)
        for rec in blast_records:
                for alignment in rec.alignments:
                                for hsp in alignment.hsps:
                                        OUT.write('\n'+ str(rec.query) +  '\t' + str(rec.query_length) + '\t' + str(alignment.title) + '\t' + str(alignment.length) + '\t' + str(hsp.align_length) + '\t' + str(hsp.query_start) + '\t' + str(hsp.query_end) + '\t' + str(hsp.sbjct_start) + '\t' + str(hsp.sbjct_end) + '\t' + str(hsp.query) + '\t' + str(hsp.sbjct) + '\t' + str(hsp.score) + '\t' + str(hsp.expect) + '\t' + str(hsp.identities) + '\t' +  (float(hsp.identities)/int(hsp.align_length)) + '\t' + str(hsp.gaps))

我尝试运行此脚本时遇到的错误如下:

Traceback (most recent call last):
File "./parse_blast.py", line 13, in <module>
for rec in blast_records:
File "/corral-repl/utexas/BioITeam/lib/python2.7/site-packages/Bio/Blast/NCBIXML.py", line 637, in parse
expat_parser.Parse(text, False)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 5

有谁知道我可以做些什么来修复问题/脚本?

由于

0 个答案:

没有答案