我正在尝试使用BeautifulSoup删除其中没有文字的标签。例如,我有以下标记:
<p>
<p>
<br/>
</p>
</p>
或
<p>
<br/>
</p>
我有以下功能:
@staticmethod
def stripTagWithNoText(soup,tagname,**kwargs):
"""Strip tags with no text"""
#Make sure that soup and tags were defined
assert isinstance(tagname,str)
#Remove tags with no text
for tag in soup.find_all(tagname):
if tag.string:
continue
for subtag in tag.findChildren():
if subtag.string:
break
else:
continue
tag.extract()
但是,这也删除了以下标签:
<p>This is some random text</p>
有人能发现这有什么问题吗?
另外,假设我在html的末尾附加了以下内容:
<p><br />
</p><p><br />
</p><p><br />
</p><p><br />
</p><p><br />
</p><p><br />
</p>
是否有某种方法可以删除html末尾的所有空格,类似于string_text.strip()?
注意 我使用的是Python3,bs4
答案 0 :(得分:0)
这对你有用吗?
from bs4 import BeautifulSoup
from bs4.element import Tag
def main():
test = """
<p>
this should not be here
<p>this should not be here
<br/>this should not be here
</p>
this should not be here
</p>
"""
soup = BeautifulSoup(test, 'html.parser')
def stripTagWithNoText(soup, tagname):
def remove(node):
for index, item in enumerate(node.contents):
if isinstance(item, Tag):
remove(node.contents[index])
else:
node.contents[index] = ''
#Remove tags with no text
for tag in soup.find_all(tagname):
remove(tag)
print(soup)
stripTagWithNoText(soup, 'p')
return 0
if __name__ == '__main__':
main()