我使用http://lxml.de/库解析html文档。到目前为止,我已经想出如何从html文档In lxml, how do I remove a tag but retain all contents?中删除标签,但该帖子中描述的方法会保留所有文本,剥离标签时不删除实际脚本。我还找到了一个对lxml.html.clean.Cleaner http://lxml.de/api/lxml.html.clean.Cleaner-class.html的类引用,但这很清楚如何实际使用该类来清理文档。任何帮助,也许是一个简短的例子对我有帮助!
答案 0 :(得分:55)
下面是一个做你想做的事的例子。对于HTML文档,Cleaner
是比使用strip_elements
更好的一般解决方案,因为在这种情况下,您想要删除的不只是<script>
标记;你也想摆脱其他标签上的onclick=function()
属性。
#!/usr/bin/env python
import lxml
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.javascript = True # This is True because we want to activate the javascript filter
cleaner.style = True # This is True because we want to activate the styles & stylesheet filter
print "WITH JAVASCRIPT & STYLES"
print lxml.html.tostring(lxml.html.parse('http://www.google.com'))
print "WITHOUT JAVASCRIPT & STYLES"
print lxml.html.tostring(cleaner.clean_html(lxml.html.parse('http://www.google.com')))
您可以在lxml.html.clean.Cleaner documentation中获取可以设置的选项列表;您可以将某些选项设置为True
或False
(默认设置),其他选项则列为:
cleaner.kill_tags = ['a', 'h1']
cleaner.remove_tags = ['p']
请注意kill vs remove之间的区别:
remove_tags:
A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.
kill_tags:
A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself.
allow_tags:
A list of tags to include (default include all).
答案 1 :(得分:4)
您可以使用strip_elements方法删除脚本,然后使用strip_tags方法删除其他标记:
etree.strip_elements(fragment, 'script')
etree.strip_tags(fragment, 'a', 'p') # and other tags that you want to remove
答案 2 :(得分:2)
您也可以将bs4 libray用于此目的。
soup = BeautifulSoup(html_src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]
答案 3 :(得分:1)
这里有一些示例,这些示例说明了如何从XML / HTML树中删除和解析不同类型的HTML元素。
关键建议:它对 不 的帮助取决于外部库并在“本地python 2”中执行一切 / 3代码”。
以下是一些如何使用“本地” python进行此操作的示例...
# (REMOVE <SCRIPT> to </script> and variations)
pattern = r'<[ ]*script.*?\/[ ]*script[ ]*>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
# (REMOVE HTML <STYLE> to </style> and variations)
pattern = r'<[ ]*style.*?\/[ ]*style[ ]*>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
# (REMOVE HTML <META> to </meta> and variations)
pattern = r'<[ ]*meta.*?>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
# (REMOVE HTML COMMENTS <!-- to --> and variations)
pattern = r'<[ ]*!--.*?--[ ]*>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
# (REMOVE HTML DOCTYPE <!DOCTYPE html to > and variations)
pattern = r'<[ ]*\![ ]*DOCTYPE.*?>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
注意:
re.IGNORECASE # is needed to match case sensitive <script> or <SCRIPT> or <Script>
re.MULTILINE # is needed to match newlines
re.DOTALL # is needed to match "special characters" and match "any character"
我已经在几个不同的HTML文件(包括,)上对此进行了测试,并且它“快速”运行并且可以跨换行使用!..
注意:不也不取决于beautifulsoup或任何其他外部下载的库!
希望这会有所帮助!
:)