提取所有<p>标签,直到</p> <h1>出现? BeautifulSoup和Python

时间:2018-10-02 15:37:17

标签: python html beautifulsoup

我正在尝试提取出现在<p>标记之后的所有<i>标记,直到遇到<h1>,然后再次重复。

示例html代码:

<h1><h1>
<p></p>
<i></i>
<p>Need to grab this</p>
<p>Need to grab this</p>
<p>Need to grab this</p>
<p>Need to grab this</p>
<h1><h1>
<p></p>

问题是<p>的不同之处在于“需要抓住这个问题”,因此有时可能会有一个,三个甚至八个。 如何创建循环并将它们全部捕获,直到再次下一个<h1>标记?

此刻我正在使用BeautifulSoup。

这是我当前的python代码:

headernum = 0
i = 0

x = soup.find_all("h1")

for i in range(len(x)):
    header = soup.find_all('h1')[headernum]
    name = header.find_all_next('p')[1]
    print(name.text)
    workplace = name.find_all_next('i')[0]
    print(workplace.text)
    abstract = workplace.find_all_next('p')[1].get_text()
    print(abstract)
    i += 1
    headernum += 1

2 个答案:

答案 0 :(得分:1)

您可以遍历element.next_siblings iterator;给定一个起始元素,循环遍历以下同级元素,直到达到结束条件为止:

for elem in start.next_siblings:
    if elem.name == 'h1':
        break
    if elem.name != 'p':
        continue
    # it's a <p> tag before the next <h1>
    ... 

演示:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <h1><h1>
... <p></p>
... <i></i>
... <p>Need to grab this</p>
... <p>Need to grab this</p>
... <p>Need to grab this</p>
... <p>Need to grab this</p>
... <h1><h1>
... <p></p>
... ''')
>>> start = soup.find('i')
>>> for elem in start.next_siblings:
...     if elem.name == 'h1':
...         break
...     if elem.name != 'p':
...         continue
...     print(elem)
...
<p>Need to grab this</p>
<p>Need to grab this</p>
<p>Need to grab this</p>
<p>Need to grab this</p>

与您现有的代码结合(略微完善):

for header in soup.find_all("h1"):
    name = header.find_next_siblings('p', limit=2)[-1]
    print(name.text)
    workplace = name.find_next_sibling('i')
    print(workplace.text)

    abstract = []
    for elem in name.next_siblings:
        if elem.name == 'h1':
            break
        if elem.name != 'p':
            continue
        # it's a <p> tag before the next <h1>
        abstract.append(elem.get_text())

    print('\n'.join(abstract))

答案 1 :(得分:-1)

使用xpath可以这样解决:

//h1/following-sibling::p

这应该为您提供p的所有h1兄弟。