我这里有一些代码,旨在在抓取时检测重定向,然后返回对其被重定向到的页面的请求,以便我可以解析重定向的页面。但是,我让刮板运行了很长时间,并且在解析重定向页面方面一无所获。请记住,我的start_urls列表是动态生成的,有时可能会非常长。我想要完成的是刮板要
抓取原始页面不应该太复杂,我已经在else语句中处理过那部分代码。我的主要问题是前三个任务。我想将重定向解析移动到scraper优先级列表的顶部。我该怎么办?
def parse(self, response):
self.logger.info("got response %d for %r" % (response.status, response.url))
# handle redirection
# this is copied/adapted from RedirectMiddleware
if response.status == 302:
self.logger.info("Response is 302")
location = to_native_str(response.headers['location'].decode('latin1'))
self.logger.info("Location: %s" % location)
# get the original request
request = response.request
# and the URL we got redirected to
redirected_url = urljoin(request.url, location)
self.logger.info("Redirected_url: %s" % redirected_url)
self.logger.info("Yielding redirect instead")
return scrapy.Request(redirected_url, callback=self.parse, meta={'dont_redirect':True})
else:
#parse the redirected page