我有以下html:
<div class="leftColumn">
<div>
<div class="static">
.............................
</div>
text1
<br>
text2
<br>
(222) 123 - 4567
<br>
<div class="summary">
.........................
</div>
</div>
我刚刚看到获取文字的方式是
soup.select('.leftColumn div')[0].text.split()
这样可行但是从2个div中遗留了很多垃圾,很难找到我需要的文本。有没有办法删除2个类(静态和摘要),这将使处理余数更容易?
答案 0 :(得分:2)
以下是基于您的代码段的示例:
from bs4 import BeautifulSoup
text = """
<div class="leftColumn">
<div>
<div class="static">
.............................
</div>
text1
<br>
text2
<br>
(222) 123 - 4567
<br>
<div class="summary">
.........................
</div>
</div>
</div>
"""
soup = BeautifulSoup(text)
# Find divs with class "static" or "summary" and remove them using `extract`
div_nodes = soup.find_all('div', {'class': ['static', 'summary']})
[div.extract() for div in div_nodes]
print soup.text.split()
如果你运行代码,你会看到静态和摘要div被删除,你得到:
[u'text1', u'text2', u'(222)', u'123', u'-', u'4567']