如何在美丽的汤中取消标签

时间:2014-09-09 22:08:31

标签: python beautifulsoup

我有一个类似于此的html文档:

<div>
<h2>Title</h2>
<div>
 <div>
  <div>
   <img alt="Some image" src="blah.gif"/>
  </div>
 </div>
</div>

我想提取它最终看起来像这样(即删除空的嵌套div)

<h2>Title</h2>
<div>
  <img alt="Some image" src="blah.gif"/>
</div>

我不介意保留外部div,如果它包含某些东西,但是任何嵌套的东西我都想剥离。

澄清一下,当我有一个div时,它包含另一个div而且就是全部,那么我想删除(解包)内部div,即代替

DIV&GT; DIV&GT; DIV&GT; DIV&GT; DIV&GT; IMG

我只想要

DIV&GT; IMG

1 个答案:

答案 0 :(得分:0)

这是我写的POC,欢迎对代码提出任何建议。

您可以向函数test添加条件,它将递归查找元素匹配条件并删除最外层。

from bs4 import BeautifulSoup

mytext ="""
<div>
<h2>
 At least he didn't go in for the hug.
</h2>
<div>
 <div>
  <div>
   <img alt="At least he didn't go in for the hug." src="handshake-fails-are-embarrassing\9lmzspj.gif"/>
  </div>
 </div>
</div>
"""

soup = BeautifulSoup(mytext)


def test(x):
    children = x.find_all(recursive=False)
    try:
        # only one child
        cri_1 = (len(children) == 1)
        # same name as its child
        cri_2 = (children[0].name == x.name)
        # no attribute but tag name
        cri_3 = (len(x.attrs) == 0)
        return cri_1 and cri_2 and cri_3
    except:
        return False

while soup.find_all(lambda x: test(x)):
    elements = soup.find_all(lambda x: test(x))
    elements[0].unwrap()

print soup.prettify()

输出:

<html>
 <body>
  <div>
   <h2>
    At least he didn't go in for the hug.
   </h2>
   <div>
    <img alt="At least he didn't go in for the hug." src="handshake-fails-are-embarrassing\9lmzspj.gif"/>
   </div>
  </div>
 </body>
</html>