Question

我使用lxml.html解析各种html页面。现在，我意识到，至少在某些页面上，尽管存在body标记，但仍然找不到它，而美丽的汤找到了它（即使它使用lxml作为解析器）。

示例页面：https://plus.google.com/（剩下的内容）

import lxml.html
import bs4

html_string = """
    ... source code of https://plus.google.com/ (manually copied) ...
"""

# lxml fails (body is None)
body = lxml.html.fromstring(html_string).find('body')

# Beautiful soup using lxml parser succeeds
body = bs4.BeautifulSoup(html_string, 'lxml').find('body')

任何关于这里发生的事情的猜测都欢迎：）

更新：

问题似乎与编码有关。

# working version
body = lxml.html.document_fromstring(html_string.encode('unicode-escape')).find('body')

Answer 1

您可以使用以下内容：

import requests
import lxml.html

html_string = requests.get("https://plus.google.com/").content
body = lxml.html.document_fromstring(html_string).find('body')

body变量包含body html元素

lxml.html找不到正文标签

1 个答案: