使用BeautifulSoup解析HTML,具体取决于之前的标记

时间:2014-07-26 02:07:29

标签: python html parsing html-parsing beautifulsoup

我有一个HTML,其中我在一些标题后面有一些带标记的文本。像这样:

<h1>Title 1</h1>
<p>Some text</p>
<p>Some other <b>text</b></p>

<h1>Title 2</h1>
<p>Some <b>text</b></p>
<p>Some text2</p>

<h1>Title 3</h1>
<p>Some text</p>
<p>Some other <i>text</i></p>

(唯一固定的是标题的数量,其余的可以改变)

如何使用BeautifulSoup提取每个HTML之后但在其余之前的所有HTML?

1 个答案:

答案 0 :(得分:1)

您可以pass a regular expression Title \d+作为text argument找到所有标题,然后使用find_next_siblings()获取下两个p标记:

import re
from bs4 import BeautifulSoup

data = """
<div>
    <h1>Title 1</h1>
    <p>Some text</p>
    <p>Some other <b>text</b></p>

    <h1>Title 2</h1>
    <p>Some <b>text</b></p>
    <p>Some text2</p>

    <h1>Title 3</h1>
    <p>Some text</p>
    <p>Some other <i>text</i></p>
</div>
"""

soup = BeautifulSoup(data)

for h1 in soup.find_all('h1', text=re.compile('Title \d+')):
    for p in h1.find_next_siblings('p', limit=2):
        print p.text.strip()

打印:

Some text
Some other text
Some text
Some text2
Some text
Some other text

或者,使用list-comprehension:

print [p.text.strip()
       for h1 in soup.find_all('h1', text=re.compile('Title \d+'))
       for p in h1.find_next_siblings('p', limit=2)]

打印:

[u'Some text', u'Some other text', u'Some text', u'Some text2', u'Some text', u'Some other text']