我尝试使用lxml.etree.parse()解析以下XML文件。此XML包含来自第3行的特殊字体的字符。此字符来自Dingbats
字体 - 值0x 7
- 电话象形图。在Notepad ++中,它显示为BEL
- 黑色矩形内的白色字母。我无法将这个问题列入问题。
<!DOCTYPE qgis PUBLIC 'http://mrcc.com/qgis.dtd' 'SYSTEM'>
<layer pass="0" class="FontMarker" locked="0">
<prop k="chr" v="!!!SPECIAL_CARACTER_HERE!!!"/>
</layer>
</qgis>
此字符使lxml(xml崩溃)崩溃,并出现以下错误:
File "lxml.etree.pyx", line 3193, in lxml.etree.parse (src/lxml/lxml.etree.c:64168)
File "parser.pxi", line 1548, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:91390)
File "parser.pxi", line 1577, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:91674)
File "parser.pxi", line 1477, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:90741)
File "parser.pxi", line 1024, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:87655)
File "parser.pxi", line 565, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:83243)
File "parser.pxi", line 656, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:84225)
File "parser.pxi", line 596, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:83549)
lxml.etree.XMLSyntaxError: invalid character in attribute value, line 3, column 14
如何解析这样的文档?
答案 0 :(得分:0)
似乎 lxml 无法与之竞争。但是,您可以使用recover
来处理错误。
recover - 尝试解析破碎的XML
>>> from lxml import etree
>>> parser = etree.XMLParser(recover=True)
>>> tree = etree.parse("/tmp/qgis.xml", parser=parser)
>>> tree.find("layer/prop").attrib
{'v': '', 'k': 'chr'}