Question

我想从html文本中删除HTML评论

<h1>heading</h1> <!-- comment-with-hyphen --> some text <-- con --> more text <hello></hello> more text

应该导致：

<h1>heading</h1> some text <-- con --> more text <hello></hello> more text

Answer 1

你不应该忽略回车。

re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)

Answer 2

最后想出了这个选项：

re.sub("()", "", t)

添加?会使搜索变得非贪婪，并且不会合并多个评论标记。

Answer 3

html = re.sub(r"<!--(.|\s|\n)*?-->", "", html)

re.sub基本上找到匹配的实例并替换为第二个参数。对于这种情况，匹配以结尾的任何内容。点和？意味着什么，\ s和\ n添加了多行评论的案例。

Answer 4

re.sub("(?s)<!--.+?-->", "", s)

或

re.sub("<!--.+?-->", "", s, flags=re.DOTALL)

Answer 5

你可以试试这个正则表达式<![^<]*>

Answer 6

不要使用正则表达式。使用XML解析器，标准库中的解析器就足够了。

from xml.etree import ElementTree as ET
html = ET.parse("comments.html")
ET.dump(html) # Dumps to stdout
ET.write("no-comments.html", method="html") # Write to a file

如何在Python中使用Regex删除HTML注释

6 个答案: