使用python和lxml模块从html中删除所有javascript标签和样式标签

时间:2011-12-18 19:01:20

标签: python html lxml

我使用http://lxml.de/库解析html文档。到目前为止,我已经想出如何从html文档In lxml, how do I remove a tag but retain all contents?中删除标签,但该帖子中描述的方法会保留所有文本,剥离标签时不删除实际脚本。我还找到了一个对lxml.html.clean.Cleaner http://lxml.de/api/lxml.html.clean.Cleaner-class.html的类引用,但这很清楚如何实际使用该类来清理文档。任何帮助,也许是一个简短的例子对我有帮助!

4 个答案:

答案 0 :(得分:55)

下面是一个做你想做的事的例子。对于HTML文档,Cleaner是比使用strip_elements更好的一般解决方案,因为在这种情况下,您想要删除的不只是<script>标记;你也想摆脱其他标签上的onclick=function()属性。

#!/usr/bin/env python

import lxml
from lxml.html.clean import Cleaner

cleaner = Cleaner()
cleaner.javascript = True # This is True because we want to activate the javascript filter
cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

print "WITH JAVASCRIPT & STYLES"
print lxml.html.tostring(lxml.html.parse('http://www.google.com'))
print "WITHOUT JAVASCRIPT & STYLES"
print lxml.html.tostring(cleaner.clean_html(lxml.html.parse('http://www.google.com')))

您可以在lxml.html.clean.Cleaner documentation中获取可以设置的选项列表;您可以将某些选项设置为TrueFalse(默认设置),其他选项则列为:

cleaner.kill_tags = ['a', 'h1']
cleaner.remove_tags = ['p']

请注意kill vs remove之间的区别:

remove_tags:
  A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.
kill_tags:
  A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself.
allow_tags:
  A list of tags to include (default include all).

答案 1 :(得分:4)

您可以使用strip_elements方法删除脚本,然后使用strip_tags方法删除其他标记:

etree.strip_elements(fragment, 'script')
etree.strip_tags(fragment, 'a', 'p') # and other tags that you want to remove

答案 2 :(得分:2)

您也可以将bs4 libray用于此目的。

soup = BeautifulSoup(html_src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]

答案 3 :(得分:1)

这里有一些示例,这些示例说明了如何从XML / HTML树中删除和解析不同类型的HTML元素。

关键建议:它对 的帮助取决于外部库并在“本地python 2”中执行一切 / 3代码”。

以下是一些如何使用“本地” python进行此操作的示例...

# (REMOVE <SCRIPT> to </script> and variations)
pattern = r'<[ ]*script.*?\/[ ]*script[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <STYLE> to </style> and variations)
pattern = r'<[ ]*style.*?\/[ ]*style[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <META> to </meta> and variations)
pattern = r'<[ ]*meta.*?>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML COMMENTS <!-- to --> and variations)
pattern = r'<[ ]*!--.*?--[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML DOCTYPE <!DOCTYPE html to > and variations)
pattern = r'<[ ]*\![ ]*DOCTYPE.*?>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

注意:

re.IGNORECASE # is needed to match case sensitive <script> or <SCRIPT> or <Script>
re.MULTILINE # is needed to match newlines
re.DOTALL # is needed to match "special characters" and match "any character" 

我已经在几个不同的HTML文件(包括,)上对此进行了测试,并且它“快速”运行并且可以跨换行使用!..

注意:也不取决于beautifulsoup或任何其他外部下载的库!

希望这会有所帮助!

:)