Question

所以现在我总是有架构：

<h2 class="dot">headline 1</h2>
<p>text</p>
<h2 class="dot">headline 2</h2>
<p>text</p>

但我抓取的某些网站可能有以下架构：

<h2 class="dot">headline 1</h2>
<p>text</p>
<p>text</p>
<h2 class="dot">headline 2</h2>
<p>text</p>

我像这样抓取它：

for product in soup.findAll("p"):

我认为无法确定不同的p元素是否属于一起。有人知道如何判断一个或两个p是否属于同一个逻辑单元吗？

一种可能的方法是确定先前的html元素是p还是h2。有没有找到它的好方法？

Answer 1

你走了：

from bs4 import BeautifulSoup

html="""
<div>
<h2 class="dot">headline 1</h2>
<p>text</p>
<p>text</p>
<h2 class="dot">headline 2</h2>
<p>text</p>
</div>
"""

soup = BeautifulSoup(html)

for h2 in soup.findAll("h2"):
    group = []
    node = h2.next_sibling

    while node is not None and node.name != "h2":
        group.append(node)
        node = node.next_sibling

    # Do w/e you want w/ the group
    print group

我所做的是通过所有的h2元素，通过他们的下一个兄弟姐妹并将它们追加到一个列表中，直到你用完兄弟姐妹或打到另一个h2。如果您只想要<p>个元素，那么您应该更改：

group.append(node)

为：

if node.name == "p":
    group.append(node)

哦，作为最后的最后评论。除非你真的需要一个列表，否则最好实际上只需要你想要的东西，而不是将它添加到循环中，如下所示：

from bs4 import BeautifulSoup

html="""
<div>
<h2 class="dot">headline 1</h2>
<p>text</p>
<p>text</p>
<h2 class="dot">headline 2</h2>
<p>text</p>
</div>
"""

soup = BeautifulSoup(html)

for h2 in soup.findAll("h2"):
    node = h2.next_sibling

    print "This h2", h2

    while node is not None and node.name != "h2":
        if node.name == "p":
            print node
        node = node.next_sibling

输出：

This h2 <h2 class="dot">headline 1</h2>
<p>text</p>
<p>text</p>
This h2 <h2 class="dot">headline 2</h2>
<p>text</p>

如何在爬行时用蟒蛇的美丽汤来确定逻辑部分

1 个答案: