使用lxml构建器进行非递归查找

时间:2016-03-25 18:17:51

标签: python python-2.7 parsing beautifulsoup lxml

我在Python 2.7中发现,如果我使用bs4.BeautifulSoup.find_all构建器,则无法执行非递归lxml

采用以下示例HTML代码段:

<p> <b> Cats </b> are interesting creatures </p>

<p> <b> Dogs </b> are cool too </p>

<div>
<p> <b> Penguins </b> are pretty neat, but they're inside a div </p>
</div>

<p> <b> Llamas </b> don't live in New York </p>

说我想找到所有直接孩子的p元素。我使用find_all执行非递归find_all("p", recursive=False)

为了对此进行测试,我将上述HTML代码段设置为名为html的变量。然后,我创建了两个BeautifulSoup个实例,ab

a = bs4.BeautifulSoup(html, "html.parser")
b = bs4.BeautifulSoup(html, "lxml")

正常使用find_all时,它们都能正常运行:

>>> a.find_all("p")
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Penguins </b> are pretty neat, but they're inside a div </p>, <p> <b> Llamas </b> don't live in New York </p>]
>>> b.find_all("p")
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Penguins </b> are pretty neat, but they're inside a div </p>, <p> <b> Llamas </b> don't live in New York </p>]

但如果我关闭递归查找,则只有a有效。 b返回一个空列表:

>>> a.find_all("p", recursive=False)
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Llamas </b> don't live in New York </p>]
>>> b.find_all("p", recursive=False)
[]

这是为什么?这是一个错误,还是我做错了什么? lxml构建器是否支持非递归find_all

1 个答案:

答案 0 :(得分:1)

这是因为lxml解析器会将您的HTML代码放入html/body(如果不存在):

>>> b = bs4.BeautifulSoup(html, "lxml")
>>> print(b)
<html><body><p> <b> Cats </b> are interesting creatures </p>
<p> <b> Dogs </b> are cool too </p>
<div>
<p> <b> Penguins </b> are pretty neat, but they're inside a div </p>
</div>
<p> <b> Llamas </b> don't live in New York </p>
</body></html>

因此,非递归模式中的find_all()会尝试在html元素中找到元素,该元素只有body子元素:

>>> print(b.find_all("p", recursive=False))
[]
>>> print(b.body.find_all("p", recursive=False))
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Llamas </b> don't live in New York </p>]