Question

我希望通过使用正则表达式将页面源（不包括注释代码）作为字符串。例如：

<html>
<head>
<p>some code</p>
<!--
 <link href='www.xxx.com'>
 -->
<head>
<body>
<p>some more code</p>
</body></html>

是否可以通过使用正则表达式来获取未注释的代码。

Answer 1

通过删除这些注释行，您可以获得所需的输出。

re.sub(r'(?s)<!--.*?-->', '', html)

示例：的

>>> html = '''<html>
<head>
<p>some code</p>
<!--
 <link href='www.xxx.com'>
 -->
<head>
<body>
<p>some more code</p>
</body></html>'''
>>> print(re.sub(r'(?s)\s*<!--.*?-->', '', html))
<html>
<head>
<p>some code</p>
<head>
<body>
<p>some more code</p>
</body></html>

用于获取html代码的Python正则表达式，不包括页面源中的注释代码

1 个答案: