Python Crawler - html.fromstring

时间:2016-12-28 12:50:14

标签: python web-crawler

我正在尝试使用此代码解析网页。

ac = requests.get('link....')
html_text = ac.text
lx = html.fromstring(html_text)

当我运行此代码时,我收到此错误

Traceback (most recent call last):
File "Crawler.py", line 197, in <module>
cnx.close()
File "Crawler.py", line 46, in RequestPage
lx = html.fromstring(html_text)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 867, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 752, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src\lxml\lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src\lxml\lxml.etree.c:76696)
File "src\lxml\parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:115101)
File "src\lxml\parser.pxi", line 1711, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:113677)
File "src\lxml\parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:107847)
File "src\lxml\parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:102150)
File "src\lxml\parser.pxi", line 694, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:103800)
File "src\lxml\parser.pxi", line 633, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:102888)
lxml.etree.XMLSyntaxError: line 1843: Tag ie:menuitem invalid

我找到导致错误的html标记:

<ie:menuitem id="MSOMenu_Help" iconsrc="/_layouts/images/HelpIcon.gif" onmenuclick="MSOWebPartPage_SetNewWindowLocation(MenuWebPart.getAttribute('helpLink'), MenuWebPart.getAttribute('helpMode'))" text="Help" type="option" style="display:none">

</ie:menuitem>

1 个答案:

答案 0 :(得分:0)

您发现HTML标记出现了错误,但是您确定了吗?如果没有试试这个:

ac = requests.get('link....') lx = html.fromstring(ac.content) valueOfHTMLTag = lx.xpath('//TAG[@class/id="Name"]/text()')

你改变的地方:

  • 要获取其值的标记中的TAG。
  • 选择标记的类或ID
  • 代码的ID /类名称

这将返回一个数组,其中包含具有正确class / id的标记值。

希望这有帮助!