Question

我想按特定类别从div中提取文本。在这个div中，我有不需要的文本和带有特定类的额外span标签。那么，如何从div中获取文本而忽略跨度中的文本呢？

一棵树看起来像：

<div class="desc"><h3 class="text-15 margin-bottom-10">Some desc:</h3>Some 
title <br/>
- text <br/>
- text<br/>
<br/>
text <br/>
<br/>
<br/>
text <br/>
@ <br/>
<br/>
text <span class="some_class">TEXT WHICH I DONT WANT</span> <br/>
<br/>
text <br/>
text <br/>
text </div>

现在我有了：

desc = source.find('div', class_="desc").text

并获取具有跨度的全文。我尝试使用decompose（），text = True，递归= False，但是不起作用。有人知道如何解决吗？

Answer 1

.extract()应该可以解决问题：

html = '''<div class="desc"><h3 class="text-15 margin-bottom-10">Some desc:</h3>Some 
title <br/>
- text <br/>
- text<br/>
<br/>
text <br/>
<br/>
<br/>
text <br/>
@ <br/>
<br/>
text <span class="some_class">TEXT WHICH I DONT WANT</span> <br/>
<br/>
text <br/>
text <br/>
text </div>'''

import bs4

soup = bs4.BeautifulSoup(html, 'html.parser')

soup.find('span').extract()
desc = soup.find('div', class_="desc").text

输出：

print (desc)
Some desc:Some 
title 
- text 
- text

text 


text 
@ 

text  

text 
text 
text

Answer 2

找到span标签并分解。

from bs4 import BeautifulSoup

data='''<div class="desc"><h3 class="text-15 margin-bottom-10">Some desc:</h3>Some 
title <br/>
- text <br/>
- text<br/>
<br/>
text <br/>
<br/>
<br/>
text <br/>
@ <br/>
<br/>
text <span class="some_class">TEXT WHICH I DONT WANT</span> <br/>
<br/>
text <br/>
text <br/>
text </div>'''
soup=BeautifulSoup(data,'html.parser')
item=soup.find('div', class_='desc').find('span')
item.decompose()
newitem=soup.find('div', class_='desc')
print(newitem.text)

输出：

Some desc:Some 
title 
- text 
- text

text 


text 
@ 

text  

text 
text 
text

如何从特定标签中跳过文本？

2 个答案: