如何从特定标签中跳过文本?

时间:2019-06-03 10:49:26

标签: python web-scraping beautifulsoup

我想按特定类别从div中提取文本。在这个div中,我有不需要的文本和带有特定类的额外span标签。那么,如何从div中获取文本而忽略跨度中的文本呢?

一棵树看起来像:

<div class="desc"><h3 class="text-15 margin-bottom-10">Some desc:</h3>Some 
title <br/>
- text <br/>
- text<br/>
<br/>
text <br/>
<br/>
<br/>
text <br/>
@ <br/>
<br/>
text <span class="some_class">TEXT WHICH I DONT WANT</span> <br/>
<br/>
text <br/>
text <br/>
text </div>

现在我有了:

desc = source.find('div', class_="desc").text 

并获取具有跨度的全文。我尝试使用decompose(),text = True,递归= False,但是不起作用。有人知道如何解决吗?

2 个答案:

答案 0 :(得分:0)

.extract()应该可以解决问题:

html = '''<div class="desc"><h3 class="text-15 margin-bottom-10">Some desc:</h3>Some 
title <br/>
- text <br/>
- text<br/>
<br/>
text <br/>
<br/>
<br/>
text <br/>
@ <br/>
<br/>
text <span class="some_class">TEXT WHICH I DONT WANT</span> <br/>
<br/>
text <br/>
text <br/>
text </div>'''

import bs4

soup = bs4.BeautifulSoup(html, 'html.parser')

soup.find('span').extract()
desc = soup.find('div', class_="desc").text 

输出:

print (desc)
Some desc:Some 
title 
- text 
- text

text 


text 
@ 

text  

text 
text 
text 

答案 1 :(得分:0)

找到span标签并分解。

from bs4 import BeautifulSoup

data='''<div class="desc"><h3 class="text-15 margin-bottom-10">Some desc:</h3>Some 
title <br/>
- text <br/>
- text<br/>
<br/>
text <br/>
<br/>
<br/>
text <br/>
@ <br/>
<br/>
text <span class="some_class">TEXT WHICH I DONT WANT</span> <br/>
<br/>
text <br/>
text <br/>
text </div>'''
soup=BeautifulSoup(data,'html.parser')
item=soup.find('div', class_='desc').find('span')
item.decompose()
newitem=soup.find('div', class_='desc')
print(newitem.text)

输出:

Some desc:Some 
title 
- text 
- text

text 


text 
@ 

text  

text 
text 
text