Question

有没有办法将find_all转换为内存效率更高的生成器？例如：

假设：

soup = BeautifulSoup(content, "html.parser")
return soup.find_all('item')

我想改为使用：

soup = BeautifulSoup(content, "html.parser")
while True:
    yield soup.next_item_generator()

（假设正确处理最终的StopIteration例外）

内置了一些生成器，但不会在查找中产生下一个结果。 find只返回第一个项目。有数千个项目，find_all吸收了很多的内存。对于5792件商品，我看到的内存只有1GB以上。

我很清楚有更高效的解析器，比如lxml，可以实现这一点。让我们假设还有其他业务限制阻止我使用其他任何东西。

如何将find_all转换为生成器，以更高效的内存方式进行迭代。

Answer 1

没有＆＃34;发现＆＃34;我所知道的BeautifulSoup 中的生成器，但我们可以结合使用SoupStrainer和.children generator。

让我们假设我们有这个示例HTML：

<div>
    <item>Item 1</item>
    <item>Item 2</item>
    <item>Item 3</item>
    <item>Item 4</item>
    <item>Item 5</item>
</div>

我们需要从中获取所有item个节点的文本。

我们可以使用SoupStrainer仅解析item代码，然后遍历.children生成器并获取文本：

from bs4 import BeautifulSoup, SoupStrainer

data = """
<div>
    <item>Item 1</item>
    <item>Item 2</item>
    <item>Item 3</item>
    <item>Item 4</item>
    <item>Item 5</item>
</div>"""

parse_only = SoupStrainer('item')
soup = BeautifulSoup(data, "html.parser", parse_only=parse_only)
for item in soup.children:
    print(item.get_text())

打印：

Item 1
Item 2
Item 3
Item 4
Item 5

换句话说，我们的想法是将树剪切到所需的标签，并使用one of the available generators，例如.children。您也可以直接使用其中一个发生器，并按生成器体内的名称或其他标准手动过滤标签，例如：类似的东西：

def generate_items(soup):
    for tag in soup.descendants:
        if tag.name == "item":
            yield tag.get_text()

.descendants递归生成子元素，而.children只考虑节点的直接子元素。

Answer 2

最简单的方法是使用find_next：

soup = BeautifulSoup(content, "html.parser")

def find_iter(tagname):
    tag = soup.find(tagname)
    while tag is not None:
        yield tag
        tag = tag.find_next(tagname)

Answer 3

Document：

我给了符合PEP 8标准的生成器名称，并将它们转换为属性：

childGenerator() -> children
nextGenerator() -> next_elements
nextSiblingGenerator() -> next_siblings
previousGenerator() -> previous_elements
previousSiblingGenerator() -> previous_siblings
recursiveChildGenerator() -> descendants
parentGenerator() -> parents

文档中有一章名为Generators，您可以阅读它。

SoupStrainer 只会解析html的部分，它可以节省内存，但它只排除不相关的标签，如果你的html有你想要的标签，就会导致同样的内存问题。

BeautifulSoup`get_all`发电机

3 个答案: