我有一个类似于此的html文档:
<div>
<h2>Title</h2>
<div>
<div>
<div>
<img alt="Some image" src="blah.gif"/>
</div>
</div>
</div>
我想提取它最终看起来像这样(即删除空的嵌套div)
<h2>Title</h2>
<div>
<img alt="Some image" src="blah.gif"/>
</div>
我不介意保留外部div,如果它包含某些东西,但是任何嵌套的东西我都想剥离。
澄清一下,当我有一个div时,它包含另一个div而且就是全部,那么我想删除(解包)内部div,即代替
DIV&GT; DIV&GT; DIV&GT; DIV&GT; DIV&GT; IMG
我只想要
DIV&GT; IMG
答案 0 :(得分:0)
这是我写的POC,欢迎对代码提出任何建议。
您可以向函数test
添加条件,它将递归查找元素匹配条件并删除最外层。
from bs4 import BeautifulSoup
mytext ="""
<div>
<h2>
At least he didn't go in for the hug.
</h2>
<div>
<div>
<div>
<img alt="At least he didn't go in for the hug." src="handshake-fails-are-embarrassing\9lmzspj.gif"/>
</div>
</div>
</div>
"""
soup = BeautifulSoup(mytext)
def test(x):
children = x.find_all(recursive=False)
try:
# only one child
cri_1 = (len(children) == 1)
# same name as its child
cri_2 = (children[0].name == x.name)
# no attribute but tag name
cri_3 = (len(x.attrs) == 0)
return cri_1 and cri_2 and cri_3
except:
return False
while soup.find_all(lambda x: test(x)):
elements = soup.find_all(lambda x: test(x))
elements[0].unwrap()
print soup.prettify()
输出:
<html>
<body>
<div>
<h2>
At least he didn't go in for the hug.
</h2>
<div>
<img alt="At least he didn't go in for the hug." src="handshake-fails-are-embarrassing\9lmzspj.gif"/>
</div>
</div>
</body>
</html>