Question

我正在使用Python来解析/清理一个html文档，但它的格式很糟糕。例如

<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>

我想将<p>\n<p>转换为<p>，但似乎无法定位\n或<p>标记之间的任何数量的空白。

到目前为止我尝试了什么

html = "<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(re.compile("<p>\\n+<p>", "<p>", html))

然而，这失败了。

Answer 1

使用以下方法：

html = "<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(r'<p>[\n\s]+<p>[\n\s]*|<(\/)p>[\n\s]+<\/p>[\n\s]*', r"<\1p>", html)

print(html)

输出：

<p>Python initially inherited its parsing from C.  While this has been
generally useful, there are some remnants which have been less useful
for Python, and should be eliminated.</p>

替换r"<\1p>"意味着如果匹配，则会从第一个捕获组/关闭标记符<(\/)p>

Answer 2

假设您还想要拧开结束标记，请尝试以下正则表达式 re.sub('(\<\/?p\>)[\s\n]*(\<\/?p\>)', r'\1',html) 请注意，这将返回html的副本，它不会更改原始

选择用\ n分隔的<p>标签

2 个答案: