从特殊字体解析字符时,lxml崩溃

时间:2016-05-19 07:52:12

标签: python xml parsing fonts lxml

我尝试使用lxml.etree.parse()解析以下XML文件。此XML包含来自第3行的特殊字体的字符。此字符来自Dingbats字体 - 值0x 7 - 电话象形图。在Notepad ++中,它显示为BEL - 黑色矩形内的白色字母。我无法将这个问题列入问题。

<!DOCTYPE qgis PUBLIC 'http://mrcc.com/qgis.dtd' 'SYSTEM'>
      <layer pass="0" class="FontMarker" locked="0">
      <prop k="chr" v="!!!SPECIAL_CARACTER_HERE!!!"/>
      </layer>
</qgis>

此字符使lxml(xml崩溃)崩溃,并出现以下错误:

  File "lxml.etree.pyx", line 3193, in lxml.etree.parse (src/lxml/lxml.etree.c:64168)
  File "parser.pxi", line 1548, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:91390)
  File "parser.pxi", line 1577, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:91674)
  File "parser.pxi", line 1477, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:90741)
  File "parser.pxi", line 1024, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:87655)
  File "parser.pxi", line 565, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:83243)
  File "parser.pxi", line 656, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:84225)
  File "parser.pxi", line 596, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:83549)
lxml.etree.XMLSyntaxError: invalid character in attribute value, line 3, column 14

如何解析这样的文档?

更新 A link to the file itself.

1 个答案:

答案 0 :(得分:0)

似乎 lxml 无法与之竞争。但是,您可以使用recover来处理错误。

  

recover - 尝试解析破碎的XML

>>> from lxml import etree
>>> parser = etree.XMLParser(recover=True)
>>> tree = etree.parse("/tmp/qgis.xml", parser=parser)
>>> tree.find("layer/prop").attrib
{'v': '', 'k': 'chr'}