Question

我使用http://lxml.de/库解析html文档。到目前为止，我已经想出如何从html文档In lxml, how do I remove a tag but retain all contents?中删除标签，但该帖子中描述的方法会保留所有文本，剥离标签时不删除实际脚本。我还找到了一个对lxml.html.clean.Cleaner http://lxml.de/api/lxml.html.clean.Cleaner-class.html的类引用，但这很清楚如何实际使用该类来清理文档。任何帮助，也许是一个简短的例子对我有帮助！

Answer 1

下面是一个做你想做的事的例子。对于HTML文档，Cleaner是比使用strip_elements更好的一般解决方案，因为在这种情况下，您想要删除的不只是<script>标记;你也想摆脱其他标签上的onclick=function()属性。

#!/usr/bin/env python

import lxml
from lxml.html.clean import Cleaner

cleaner = Cleaner()
cleaner.javascript = True # This is True because we want to activate the javascript filter
cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

print "WITH JAVASCRIPT & STYLES"
print lxml.html.tostring(lxml.html.parse('http://www.google.com'))
print "WITHOUT JAVASCRIPT & STYLES"
print lxml.html.tostring(cleaner.clean_html(lxml.html.parse('http://www.google.com')))

您可以在lxml.html.clean.Cleaner documentation中获取可以设置的选项列表;您可以将某些选项设置为True或False（默认设置），其他选项则列为：

cleaner.kill_tags = ['a', 'h1']
cleaner.remove_tags = ['p']

请注意kill vs remove之间的区别：

remove_tags:
  A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.
kill_tags:
  A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself.
allow_tags:
  A list of tags to include (default include all).

Answer 2

您可以使用strip_elements方法删除脚本，然后使用strip_tags方法删除其他标记：

etree.strip_elements(fragment, 'script')
etree.strip_tags(fragment, 'a', 'p') # and other tags that you want to remove

Answer 3

您也可以将bs4 libray用于此目的。

soup = BeautifulSoup(html_src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]

Answer 4

这里有一些示例，这些示例说明了如何从XML / HTML树中删除和解析不同类型的HTML元素。

关键建议：它对不的帮助取决于外部库并在“本地python 2”中执行一切 / 3代码”。

以下是一些如何使用“本地” python进行此操作的示例...

# (REMOVE <SCRIPT> to </script> and variations)
pattern = r'<[ ]*script.*?\/[ ]*script[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <STYLE> to </style> and variations)
pattern = r'<[ ]*style.*?\/[ ]*style[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <META> to </meta> and variations)
pattern = r'<[ ]*meta.*?>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML COMMENTS <!-- to --> and variations)
pattern = r'<[ ]*!--.*?--[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML DOCTYPE <!DOCTYPE html to > and variations)
pattern = r'<[ ]*\![ ]*DOCTYPE.*?>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

注意：

re.IGNORECASE # is needed to match case sensitive <script> or <SCRIPT> or <Script>
re.MULTILINE # is needed to match newlines
re.DOTALL # is needed to match "special characters" and match "any character"

我已经在几个不同的HTML文件（包括，）上对此进行了测试，并且它“快速”运行并且可以跨换行使用！..

注意：不也不取决于beautifulsoup或任何其他外部下载的库！

希望这会有所帮助！

：）

使用python和lxml模块从html中删除所有javascript标签和样式标签

4 个答案: