Question

我最近从Beautifulsoup切换到lxml，因为lxml可以使用破碎的HTML，这是我的情况。我想知道实现Beautifulsoup find（）的等效或程序形式是什么。你在BS中看到我能够通过这样的搜索找到一个树节点：

bs = BeautifulSoup(html)
bs.find('span', {'class': 'some-class-name'})

lxml find（）只搜索树上的当前级别，如果我想在所有树节点中搜索该怎么办？

由于

Answer 1

您可以使用cssselect：

root = lxml.html.fromstring(html)
root.cssselect('span.some-class-name')

root.xpath('.//span[@class="some-class-name"]')

cssselect，xpath方法都返回BeautifulSoup中匹配元素的列表，如findAll/find_all方法。

Answer 2

如果您不想学习lxml或xpath表达式的api，那么这是另一种选择：

Beautiful Soup支持Python标准库中包含的HTML解析器，但它也支持许多第三方Python解析器。一个是lxml解析器[...]

并指定要使用的特定解析器：

BeautifulSoup(markup, "lxml")