Question

我正在尝试使用the solutions in this post将HTML表解析为python（2.7）。当我尝试使用字符串中的前两个中的任何一个（如示例中）时，它完美无缺。但是当我尝试在HTML页面上使用etree.xml时，我使用urlib读取了一个错误。我检查了每个解决方案，我传递的变量也是一个str。对于以下代码：

from lxml import etree
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = etree.XML(s)

我收到此错误：

文件“C：/Users/user/PycharmProjects/Wikipedia/TestingFile.py”，行   9，在table = etree.XML（s）

文件“lxml.etree.pyx”，第2723行，在lxml.etree.XML中   （SRC / LXML / lxml.etree.c：52448）

文件“parser.pxi”，第1573行，在lxml.etree._parseMemoryDocument中   （SRC / LXML / lxml.etree.c：79932）

文件“parser.pxi”，第1452行，位于lxml.etree._parseDoc中   （SRC / LXML / lxml.etree.c：78774）

文件“parser.pxi”，第960行，在lxml.etree._BaseParser._parseDoc中   （SRC / LXML / lxml.etree.c：75389）

文件“parser.pxi”，第564行，in   lxml.etree._ParserContext._handleParseResultDoc   （SRC / LXML / lxml.etree.c：71739）

文件“parser.pxi”，第645行，在lxml.etree._handleParseResult中   （SRC / LXML / lxml.etree.c：72614）

文件“parser.pxi”，第585行，在lxml.etree._raiseParseError中   （src / lxml / lxml.etree.c：71955）lxml.etree.XMLSyntaxError：打开和   结束标记不匹配：链接第8行和第8行，第48行

并且代码为：

from xml.etree import ElementTree as ET
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = ET.XML(s)

我收到此错误：

Traceback（最近一次调用最后一次）：文件   “C：/Users/user/PycharmProjects/Wikipedia/TestingFile.py”，第6行，在          table = ET.XML（s）

文件“C：\ Python27 \ lib \ xml \ etree \ ElementTree.py”，第1300行，XML格式       parser.feed（文本）

文件“C：\ Python27 \ lib \ xml \ etree \ ElementTree.py”，第1642行，在Feed中       self._raiseerror（V）

文件“C：\ Python27 \ lib \ xml \ etree \ ElementTree.py”，第1506行，   _raiseerror       raise err xml.etree.ElementTree.ParseError：不匹配的标签：第8行，第111列

Answer 1

虽然它们看起来可能是相同的标记类型，但HTML并不像XML那样严格，并且遵循标记规则（打开/关闭节点，转义实体等）。因此，XML可能不允许传递HTML。

因此，请考虑使用etree的HTML()函数来解析页面。此外，您可以使用XPath来定位要提取或使用的特定区域。以下是尝试拉出主页面的示例。请注意，网页使用了相当多的嵌套表。

from lxml import etree
import urllib.request as rq
yearurl = "http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s = rq.urlopen(yearurl).read()
print(type(s))

# PARSE PAGE
htmlpage = etree.HTML(s)

# XPATH TO SPECIFIC CONTENT
htmltable = htmlpage.xpath("//table[tr/td/font/a/b='Rank']//text()")

for row in htmltable:
    print(row)

使用urlib

1 个答案: