我想使用beautifulsoup提取段落元素中的文本。 html看起来像这样:
<span class="span_class>
<h1>heading1</h1>
<p>para1</p>
<h1>heading 2</h1>
<p>para2</p>
</span>
我想只在h1存在的情况下从第一个p中提取文本,依此类推; 到目前为止我已经尝试了
x=soup.findAll('span',{'class':'span_class'})
y=x.findAll('p')[0].text
但我没有得到它。
答案 0 :(得分:0)
您可以在此处使用CSS同级选择器:
paragraphs = x.select('h1 + p')
# `paragraphs` now contains two elements: <p>para1</p> and <p>para2</p>
这将仅选择那些在它们之前具有直接H1兄弟的P元素。 如果你想根据H1内容做更多的逻辑,你可以这样做:
for p x.select('h1:first-child + p'):
# `p` contains the element that has `H1` before it.
# `p.previous_sibling` contains `H1`.
if p.previous_sibling.text == 'heading1':
# We got the `P` that has `H1` with content `"heading1"` before it.
print(p, p.previous_sibling)
答案 1 :(得分:0)
html = '''<html>
<body>
<span class='span_class'>
<h1>heading1</h1>
<p>content1</p>
<p>content2</p>
<h1>heading2</h1>
<p>content3</p>
</span>
</body>
</html>'''
soup = bs(html, 'lxml')
x = soup.find_all('span',{'class':'span_class'}) #find span
try:
for y in x:
heading = y.find_all('h1') # find h1
for something in heading: # if h1 exist
if something.text == 'heading1':
print(something.text) # print h1
try:
p = something.find_next('p') #try find next p
print(p)
except: # if no next <p>, do nothing
pass
else:
pass #if is is not 'heading1', do nothing
except Exception as e:
print(e)
这是你在找什么?它会尝试查找您的<span>
并尝试从中找到<h1>
。如果<h1>
位于<span>
,则会查找下一个<p>
。