我想抓取'https://book.douban.com/annual/2016'时,无法获取数据。代码在底部。
它显示“您好,您是否尝试过将其关闭然后再次打开?”。
但是我在“ http://localhost:8050”中运行了脚本,它向我展示了我想要的东西。
lua_script = """
function main(splash)
assert(splash:autoload("https://img3.doubanio.com/f/ithil/31683c94fc5c3d40cb6e3d541825be4956a1220d/js/lib/es5-shim.min.js"))
assert(splash:autoload("https://img3.doubanio.com/f/ithil/a7de8db438da176dd0eeb59efe46306b39f1261f/js/lib/es6-shim.min.js"))
assert(splash:autoload("https://img3.doubanio.com/dae/cdnlib/libs/jweixin/1.0.0/jweixin.js"))
assert(splash:autoload("https://img3.doubanio.com/f/ithil/dd4fe4440669275cafde939df8cfdd32ca1252e5/gen/ithil.bundle.js"))
assert(splash:autoload("https://hm.baidu.com/hm.js?16a14f3002af32bf3a75dfe352478639"))
assert(splash:go(splash.args.url))
assert(splash:wait(0.5))
return splash:html()
end
"""
class DoubanbookSpider(scrapy.Spider):
name = 'doubanBook-2016'
allowed_domains = ['book.douban.com']
start_urls = ['http://book.douban.com/']
def start_requests(self):
base_url = 'https://book.douban.com/annual/2016'
#yield SplashRequest(base_url)
yield SplashRequest(base_url, endpoint='execute', args={'lua_source': lua_script},\
cache_args=['lua_source'])
def parse(self, response):
############################################
print(response.body) # result contain "Hello IT, have you tried turning it off and on again?".
listname= response.css('h1 div::text').extract_first()
print(listname) # result is None