Question

我正在尝试提取出现在<p>标记之后的所有<i>标记，直到遇到<h1>，然后再次重复。

示例html代码：

<h1><h1>
<p></p>
<i></i>
<p>Need to grab this</p>
<p>Need to grab this</p>
<p>Need to grab this</p>
<p>Need to grab this</p>
<h1><h1>
<p></p>

问题是<p>的不同之处在于“需要抓住这个问题”，因此有时可能会有一个，三个甚至八个。如何创建循环并将它们全部捕获，直到再次下一个<h1>标记？

此刻我正在使用BeautifulSoup。

这是我当前的python代码：

headernum = 0
i = 0

x = soup.find_all("h1")

for i in range(len(x)):
    header = soup.find_all('h1')[headernum]
    name = header.find_all_next('p')[1]
    print(name.text)
    workplace = name.find_all_next('i')[0]
    print(workplace.text)
    abstract = workplace.find_all_next('p')[1].get_text()
    print(abstract)
    i += 1
    headernum += 1

Answer 1

您可以遍历element.next_siblings iterator；给定一个起始元素，循环遍历以下同级元素，直到达到结束条件为止：

for elem in start.next_siblings:
    if elem.name == 'h1':
        break
    if elem.name != 'p':
        continue
    # it's a <p> tag before the next <h1>
    ...

演示：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <h1><h1>
... <p></p>
... <i></i>
... <p>Need to grab this</p>
... <p>Need to grab this</p>
... <p>Need to grab this</p>
... <p>Need to grab this</p>
... <h1><h1>
... <p></p>
... ''')
>>> start = soup.find('i')
>>> for elem in start.next_siblings:
...     if elem.name == 'h1':
...         break
...     if elem.name != 'p':
...         continue
...     print(elem)
...
<p>Need to grab this</p>
<p>Need to grab this</p>
<p>Need to grab this</p>
<p>Need to grab this</p>

与您现有的代码结合（略微完善）：

for header in soup.find_all("h1"):
    name = header.find_next_siblings('p', limit=2)[-1]
    print(name.text)
    workplace = name.find_next_sibling('i')
    print(workplace.text)

    abstract = []
    for elem in name.next_siblings:
        if elem.name == 'h1':
            break
        if elem.name != 'p':
            continue
        # it's a <p> tag before the next <h1>
        abstract.append(elem.get_text())

    print('\n'.join(abstract))

Answer 2

使用xpath可以这样解决：

//h1/following-sibling::p

这应该为您提供p的所有h1兄弟。

提取所有<p>标签，直到</p> <h1>出现？ BeautifulSoup和Python

2 个答案: