我有一个页面,其中包含几个重复:<div...><h4>...<p>...
例如:
html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
如果我写print soup.select('div[class^="proletariat"] > h4 ~ p')
,我会得到:
[<p>Ignore this text</p>, <p>This is the text we want</p>]
如何在<h4>hammer</h4>
之前指定我只需要p的文本?
由于
答案 0 :(得分:1)
html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.find("h4", text=re.compile('hammer')).next_sibling.next.text)
This is the text we want
答案 1 :(得分:1)
:contains()
可以提供帮助,但不支持。
考虑到这一点,您可以将select()
与find_next_sibling()
结合使用:
print next(h4.find_next_sibling('p').text
for h4 in soup.select('div[class^="proletariat"] > h4')
if h4.text == "hammer")