我想使用Scrapy和Splash抓取包含javascript的网页。
在页面中,<script type = text/javascript> JS_FUNCTIONS(generate html content) </script>
存在,所以我在运行JS_FUNCTIONS之后尝试获取html文件,如下所示。
import scrapy
from scrapy_splash import SplashRequest
class FooSpider(scrapy.Spider):
name = 'foo'
start_urls = ["http://foo.com"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
file_name = response.url.split("//")[-1]
with open(filename, 'wb') as f:
f.write(response.body)
当我执行命令scrapy crawl foo
时,它返回仍然包含<script type = text/javascript> JS_FUNCTIONS(generate html content) </script>
的html文件,并且不包含应由JS_FUNCTIONS生成的html内容。
如何获取包含javascript生成内容的html文件?
感谢。
答案 0 :(得分:0)
也许尝试使用以下lua代码执行:
lua_code = """
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
return {
html = splash:html(),
}
end
"""
SplashRequest(url,self.parse, args={'lua_source': lua_code}, endpoint='execute')