如何用&读取xml文件标志

时间:2016-05-03 13:38:22

标签: python parsing lxml

这是我的xml文件:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE papers>
<papers>
  <paper>
    <title>Title containing & and more</title>
  </paper>
</papers>

如何使用lxml&#39; etree阅读?我试过了

from lxml import etree

with open(xml_file, 'r') as inf:
    tree = etree.parse(inf)

但它会产生以下Traceback:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69955)
  File "parser.pxi", line 1769, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102257)
  File "parser.pxi", line 1789, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:102516)
  File "parser.pxi", line 1684, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:101442)
  File "parser.pxi", line 1134, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:97069)
  File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91275)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92461)
  File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91757)
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 5, column 30

2 个答案:

答案 0 :(得分:6)

如果您需要保留&字符,则可以将该文件解析为HTML。

from lxml import html
tree = html.parse(path)

如果您需要&字符,则可以创建新的XML解析器并传递recover=True选项。

from lxml import etree
parser = etree.XMLParser(recover=True)
tree = etree.parse(path, parser=parser)

答案 1 :(得分:3)

由于xml文件格式错误,因为ampersand(预定义的xml实体),如果可以,请使用BeautifulSoup。它是一个更容错的解析器。

from bs4 import BeautifulSoup
soup = BeautifulSoup(data)
print soup.find("title").text

输出

Title containing & and more