使用python regex删除跨度元素周围的换行符和空格

时间:2019-05-03 01:22:24

标签: python beautifulsoup python-regex

使用BeautifulSoup的美化功能后,我想从span以及其他内联标签中删除换行符和缩进。

例如,我目前有这样的东西:

>>> import bs4
>>> html = "<div><p>I don't want this <span>span element</span> on it's one line.</p></div>"
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> soup.prettify()
"<div>\n <p>\n  I don't want this\n  <span>\n   span element\n  </span>\n  on its one line.\n </p>\n</div>"
>>> print(soup.prettify())
<div>
 <p>
  I don't want this
  <span>
   span element
  </span>
  on it's one line.
 </p>
</div>

我可以使用什么正则表达式删除span标记周围的缩进空格和换行符,以便最终得到这个结果:

<div>
 <p>
  I don't want this <span>span element</span> on its one line.
 </p>
</div>

1 个答案:

答案 0 :(得分:0)

检查一下:

import re

html = '''
    <div>
        <p>
            I don't want this
            <span>
                span element
            </span>
            on it's one line.
        </p>
    </div>
'''

soup = bs4.BeautifulSoup(html)

## getting prettified output 
html = soup.prettify()


# removing \n and space before and after <span> tag
html = re.sub('[ \n]+<span>[ \n]+','<span>', html)

# removing \n and space before and after </span> tag
html = re.sub('[ \n]+</span>[ \n]+','</span>', html)

执行print(html)将为您提供以下输出:

<div>
   <p>
       I don't want this<span>span element</span>on it's one line.
   </p>
</div>

您可以创建一个针对不同标签执行此操作的函数:

import re

def prettify_output(html, tag):
    html = re.sub(f'[ \n]+<{tag}>[ \n]+',f'<{tag}>', html)
    html = re.sub(f'[ \n]+</{tag}>[ \n]+',f'</{tag}>', html)
    return html

## call 
html = prettify_output(html, 'span')