选择用\ n分隔的<p>标签

时间:2017-02-11 18:15:35

标签: python regex

我正在使用Python来解析/清理一个html文档,但它的格式很糟糕。例如

<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>

我想将<p>\n<p>转换为<p>,但似乎无法定位\n<p>标记之间的任何数量的空白。

到目前为止我尝试了什么

html = "<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(re.compile("<p>\\n+<p>", "<p>", html))

然而,这失败了。

2 个答案:

答案 0 :(得分:2)

使用以下方法:

html = "<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(r'<p>[\n\s]+<p>[\n\s]*|<(\/)p>[\n\s]+<\/p>[\n\s]*', r"<\1p>", html)

print(html)

输出:

<p>Python initially inherited its parsing from C.  While this has been
generally useful, there are some remnants which have been less useful
for Python, and should be eliminated.</p>

替换r"<\1p>"意味着如果匹配,则会从第一个捕获组/关闭标记符<(\/)p>

答案 1 :(得分:0)

假设您还想要拧开结束标记,请尝试以下正则表达式 re.sub('(\<\/?p\>)[\s\n]*(\<\/?p\>)', r'\1',html) 请注意,这将返回html的副本,它不会更改原始