我有一个HTML,其中我在一些标题后面有一些带标记的文本。像这样:
<h1>Title 1</h1>
<p>Some text</p>
<p>Some other <b>text</b></p>
<h1>Title 2</h1>
<p>Some <b>text</b></p>
<p>Some text2</p>
<h1>Title 3</h1>
<p>Some text</p>
<p>Some other <i>text</i></p>
(唯一固定的是标题的数量,其余的可以改变)
如何使用BeautifulSoup提取每个HTML之后但在其余之前的所有HTML?
答案 0 :(得分:1)
您可以pass a regular expression Title \d+
作为text
argument找到所有标题,然后使用find_next_siblings()
获取下两个p
标记:
import re
from bs4 import BeautifulSoup
data = """
<div>
<h1>Title 1</h1>
<p>Some text</p>
<p>Some other <b>text</b></p>
<h1>Title 2</h1>
<p>Some <b>text</b></p>
<p>Some text2</p>
<h1>Title 3</h1>
<p>Some text</p>
<p>Some other <i>text</i></p>
</div>
"""
soup = BeautifulSoup(data)
for h1 in soup.find_all('h1', text=re.compile('Title \d+')):
for p in h1.find_next_siblings('p', limit=2):
print p.text.strip()
打印:
Some text
Some other text
Some text
Some text2
Some text
Some other text
或者,使用list-comprehension:
print [p.text.strip()
for h1 in soup.find_all('h1', text=re.compile('Title \d+'))
for p in h1.find_next_siblings('p', limit=2)]
打印:
[u'Some text', u'Some other text', u'Some text', u'Some text2', u'Some text', u'Some other text']