Question

我正在使用Django和Python 3.7。我想加快我的HTML解析速度。目前，我正在文档中寻找三种类型的元素，

req = urllib2.Request(fullurl, headers=settings.HDR)
html = urllib2.urlopen(req).read()
comments_soup = BeautifulSoup(html, features="html.parser")

score_elts = comments_soup.findAll("div", {"class": "score"})

comments_elts = comments_soup.findAll("a", attrs={'class': 'comments'})

bad_elts = comments_soup.findAll("span", text=re.compile("low score"))

我已经读过SoupStrainer是提高执行效果的一种方法-https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document。但是，所有示例仅讨论使用单个过滤器解析HTML文档。就我而言，我有三个。我如何才能将三个过滤器传递到我的解析中，或者实际上会像现在这样做会导致性能更差？

Answer 1

我认为您不能将多个过滤器传递给BeautifulSoup构造函数。相反，您可以做的是将所有条件包装到一个Strainer中，并将其传递给BeautifulSoup构造函数。

对于简单的情况，例如仅标记名称，您可以将列表传递到SoupStrainer

html="""
<a>yes</a>
<p>yes</p>
<span>no</span>
"""
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
custom_strainer = SoupStrainer(["a","p"])
soup=BeautifulSoup(html, "lxml", parse_only=custom_strainer)
print(soup)

输出

<a>yes</a><p>yes</p>

要指定更多逻辑，还可以传入自定义函数（您可能必须这样做）。

html="""
<html class="test">
<a class="wanted">yes</a>
<a class="not-wanted">no</a>
<p>yes</p>
<span>no</span>
</html>
"""
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
def my_function(elem,attrs):
    if elem=='a' and attrs['class']=="wanted":
        return True
    elif elem=='p':
        return True
custom_strainer= SoupStrainer(my_function)
soup=BeautifulSoup(html, "lxml", parse_only=custom_strainer)
print(soup)

输出

<a class="wanted">yes</a><p>yes</p>

按照文档中的说明

仅解析文档的一部分不会节省大量时间来解析文档文档，但它可以节省大量内存，并且可以进行搜索文档快得多。

我认为您应该查看文档的Improving performance部分。

可以对一个BeautifulSoup文档使用多个过滤器吗？

1 个答案: