本文假设以下背景:
解释的最佳方式是举例:
<h2>Alpha blurb</h2>
* content here one
* content here two
<h2>Bravo blurb</h2>
* content here one
* content here two
* content here tree
* content here four
* content here fyve
* content here seeks
<h2>Charlie blurb</h2>
* content here four
* content here fyve
* content here seeks
<h2>Delta blurb</h2>
* blah
从Trevor到目前为止看到的,Bsoup使用一种策略来抓取内容,包括查找容器元素并迭代它们并钻入它们。
但是,在这种情况下,Trevor希望提取每个Header项及其相关内容,即使相关内容未包含在包含元素中。
一个内容部分的开始位置和另一个内容部分的唯一指示是标题标记的放置。
bsoup4的文档可以在哪里搜索,或者Trevor可以查找哪些术语来封装这个原则并获得他想要做的结果?
答案 0 :(得分:1)
.next_siblings
。例如:
from bs4 import BeautifulSoup
page = """
<div>
<h2>Alpha blurb</h2>
* content here one
* content here two
<h2>Bravo blurb</h2>
* content here one
* content here two
* content here tree
* content here four
* content here fyve
* content here seeks
<h2>Charlie blurb</h2>
* content here four
* content here fyve
* content here seeks
<h2>Delta blurb</h2>
* blah
</div>
"""
soup = BeautifulSoup(page)
for h2 in soup.find_all("h2"):
print h2.text
# loop over siblings until h2 is met (or no more siblings left)
for item in h2.next_siblings:
if item.name == "h2":
break
print item.strip()
print "----"
打印:
Alpha blurb
* content here one
* content here two
----
Bravo blurb
* content here one
* content here two
* content here tree
* content here four
* content here fyve
* content here seeks
----
Charlie blurb
* content here four
* content here fyve
* content here seeks
----
Delta blurb
* blah
----