使用美丽的汤去除html元素

时间:2014-08-05 19:41:35

标签: python html beautifulsoup

我有以下html:

<div class="leftColumn">
  <div>
     <div class="static">
      .............................
     </div>  
     text1
     <br>
     text2
     <br>
     (222) 123 - 4567
     <br>
     <div class="summary">
     .........................
     </div>
  </div>

我刚刚看到获取文字的方式是

soup.select('.leftColumn div')[0].text.split()

这样可行但是从2个div中遗留了很多垃圾,很难找到我需要的文本。有没有办法删除2个类(静态和摘要),这将使处理余数更容易?

1 个答案:

答案 0 :(得分:2)

以下是基于您的代码段的示例:

from bs4 import BeautifulSoup

text = """
<div class="leftColumn">
  <div>
     <div class="static">
      .............................
     </div>
     text1
     <br>
     text2
     <br>
     (222) 123 - 4567
     <br>
     <div class="summary">
     .........................
     </div>
  </div>
</div>
"""

soup = BeautifulSoup(text)

# Find divs with class "static" or "summary" and remove them using `extract`
div_nodes = soup.find_all('div', {'class': ['static', 'summary']})
[div.extract() for div in div_nodes]

print soup.text.split()

如果你运行代码,你会看到静态和摘要div被删除,你得到:

[u'text1', u'text2', u'(222)', u'123', u'-', u'4567']