Question

我试图删除特定的HTML文档块，尤其是Javascript（<script></script>）和内联CSS（<style></style>）。目前我正在尝试使用re.sub()，但我对Multiline没有任何好运。有什么提示吗？

import re

s = '''<html>
<head>
  <title>Some Template</title>
  <script type="text/javascript" src="{path to Library}/base.js"></script>
  <script type="text/javascript" src="something.js"></script>
  <script type="text/javascript" src="simple.js"></script>
</head>
<body>
  <script type="text/javascript">
    // HelloWorld template
    document.write(examples.simple.helloWorld());
  </script>
</body>
</html>'''

print(re.sub('<script.*script>', '', s, count=0, flags=re.M))

Answer 1

或者，由于您正在解析和修改HTML，我建议使用像BeautifulSoup这样的HTML解析器。

如果您只想删除/删除HTML树中的所有script标记。您可以使用.decompose()或.extract()。

.extract()将返回已提取的标记，而.decompose()只会销毁。

from bs4 import BeautifulSoup

soup = BeautifulSoup(s, "html.parser")
for i in soup('script'):
    i.decompose()

print(soup)

如评论中所述，您可以对HTML树进行其他修改。您可以参考docs了解更多信息。

Answer 2

实际上你需要DOTALL修饰符而不是Multiline。

print(re.sub(r'(?s)<script\b.*?</script>', '', s))

这将删除script标记之前存在的前导空格。

print(re.sub(r'(?s)\s*<script\b.*?</script>', '', s))

在Python中删除多行HTML

2 个答案: