我想按特定类别从div中提取文本。在这个div中,我有不需要的文本和带有特定类的额外span标签。那么,如何从div中获取文本而忽略跨度中的文本呢?
一棵树看起来像:
<div class="desc"><h3 class="text-15 margin-bottom-10">Some desc:</h3>Some
title <br/>
- text <br/>
- text<br/>
<br/>
text <br/>
<br/>
<br/>
text <br/>
@ <br/>
<br/>
text <span class="some_class">TEXT WHICH I DONT WANT</span> <br/>
<br/>
text <br/>
text <br/>
text </div>
现在我有了:
desc = source.find('div', class_="desc").text
并获取具有跨度的全文。我尝试使用decompose(),text = True,递归= False,但是不起作用。有人知道如何解决吗?
答案 0 :(得分:0)
.extract()
应该可以解决问题:
html = '''<div class="desc"><h3 class="text-15 margin-bottom-10">Some desc:</h3>Some
title <br/>
- text <br/>
- text<br/>
<br/>
text <br/>
<br/>
<br/>
text <br/>
@ <br/>
<br/>
text <span class="some_class">TEXT WHICH I DONT WANT</span> <br/>
<br/>
text <br/>
text <br/>
text </div>'''
import bs4
soup = bs4.BeautifulSoup(html, 'html.parser')
soup.find('span').extract()
desc = soup.find('div', class_="desc").text
输出:
print (desc)
Some desc:Some
title
- text
- text
text
text
@
text
text
text
text
答案 1 :(得分:0)
找到span标签并分解。
from bs4 import BeautifulSoup
data='''<div class="desc"><h3 class="text-15 margin-bottom-10">Some desc:</h3>Some
title <br/>
- text <br/>
- text<br/>
<br/>
text <br/>
<br/>
<br/>
text <br/>
@ <br/>
<br/>
text <span class="some_class">TEXT WHICH I DONT WANT</span> <br/>
<br/>
text <br/>
text <br/>
text </div>'''
soup=BeautifulSoup(data,'html.parser')
item=soup.find('div', class_='desc').find('span')
item.decompose()
newitem=soup.find('div', class_='desc')
print(newitem.text)
输出:
Some desc:Some
title
- text
- text
text
text
@
text
text
text
text