我在使用Python中的lxml解析JS时遇到问题。当我执行下面的代码时,我的输出是:
“<元素div在0x10cec4e10>”
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.javascript = True
text = urllib2.urlopen("URL").read().decode("utf-8")
test = lxml.html.fromstring(cleaner.clean_html(text))
print test
我想要的是没有js东西的解析文本。有人可以解释一下吗?感谢。
答案 0 :(得分:1)
import lxml
import urllib2
URL = "http://www.google.com/"
ENCODING = "latin1"
args = {
"javascript": True, # strip javascript
"page_structure": False, # leave page structure alone
"style": True # remove CSS styling
}
cleaner = lxml.html.clean.Cleaner(**args)
# get the page source
html = urllib2.urlopen(URL).read().decode(ENCODING)
# clean it up
clean = cleaner.clean_html(html)
# print unformatted html dump
print(clean)
# print properly indented html
tree = lxml.html.fromstring(clean)
print(lxml.etree.tostring(tree, pretty_print=True))
请注意,漂亮的打印与lxml.etree.tostring()一起正常工作,但是lxml.html.tostring()很糟糕,lxml.html.tostring()执行换行但没有缩进 - 去图。