我正在使用Python来解析/清理一个html文档,但它的格式很糟糕。例如
<p>\n<p>\n Python initially inherited its parsing from C. While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>
我想将<p>\n<p>
转换为<p>
,但似乎无法定位\n
或<p>
标记之间的任何数量的空白。
到目前为止我尝试了什么
html = "<p>\n<p>\n Python initially inherited its parsing from C. While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(re.compile("<p>\\n+<p>", "<p>", html))
然而,这失败了。
答案 0 :(得分:2)
使用以下方法:
html = "<p>\n<p>\n Python initially inherited its parsing from C. While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(r'<p>[\n\s]+<p>[\n\s]*|<(\/)p>[\n\s]+<\/p>[\n\s]*', r"<\1p>", html)
print(html)
输出:
<p>Python initially inherited its parsing from C. While this has been
generally useful, there are some remnants which have been less useful
for Python, and should be eliminated.</p>
替换r"<\1p>"
意味着如果匹配,则会从第一个捕获组/
关闭标记符<(\/)p>
答案 1 :(得分:0)
假设您还想要拧开结束标记,请尝试以下正则表达式
re.sub('(\<\/?p\>)[\s\n]*(\<\/?p\>)', r'\1',html)
请注意,这将返回html的副本,它不会更改原始