使用BeautifulSoup的美化功能后,我想从span
以及其他内联标签中删除换行符和缩进。
例如,我目前有这样的东西:
>>> import bs4
>>> html = "<div><p>I don't want this <span>span element</span> on it's one line.</p></div>"
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> soup.prettify()
"<div>\n <p>\n I don't want this\n <span>\n span element\n </span>\n on its one line.\n </p>\n</div>"
>>> print(soup.prettify())
<div>
<p>
I don't want this
<span>
span element
</span>
on it's one line.
</p>
</div>
我可以使用什么正则表达式删除span标记周围的缩进空格和换行符,以便最终得到这个结果:
<div>
<p>
I don't want this <span>span element</span> on its one line.
</p>
</div>
答案 0 :(得分:0)
检查一下:
import re
html = '''
<div>
<p>
I don't want this
<span>
span element
</span>
on it's one line.
</p>
</div>
'''
soup = bs4.BeautifulSoup(html)
## getting prettified output
html = soup.prettify()
# removing \n and space before and after <span> tag
html = re.sub('[ \n]+<span>[ \n]+','<span>', html)
# removing \n and space before and after </span> tag
html = re.sub('[ \n]+</span>[ \n]+','</span>', html)
执行print(html)
将为您提供以下输出:
<div>
<p>
I don't want this<span>span element</span>on it's one line.
</p>
</div>
您可以创建一个针对不同标签执行此操作的函数:
import re
def prettify_output(html, tag):
html = re.sub(f'[ \n]+<{tag}>[ \n]+',f'<{tag}>', html)
html = re.sub(f'[ \n]+</{tag}>[ \n]+',f'</{tag}>', html)
return html
## call
html = prettify_output(html, 'span')