文本不是由lxml处理的

时间:2015-07-03 15:59:29

标签: python lxml

我的html文件包含以下行

<tr><td>$nbsp;</td><tr> 

但是当我使用lxml进行解析时:

from lxml import tree as ET
tree = ET.parse("file.html")

我收到以下错误:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 3310, in lxml.etree.parse (src/lxml/lxml.etree.c:72517)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:105979)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106278)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105277)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100227)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94350)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95786)
File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94853)
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 14, column 159

1 个答案:

答案 0 :(得分:13)

Use lxml.html, not lxml.etree, for HTML. &nbsp;合法地未在XML中预定义,但它可用于HTML。因此:

>>> lxml.html.fromstring('''<tr><td>&nbsp;</td><tr>''')
<Element div at 0x10a7a5e68>

......工作正常。

或者,您可以在文档中使用&nbsp;的XML等效项,即&#160;,或者您可以在XML文件中声明DOCTYPE并包含<!ENTITY nbsp "&#160;">在其内容中。