Question

我这里有一些代码，旨在在抓取时检测重定向，然后返回对其被重定向到的页面的请求，以便我可以解析重定向的页面。但是，我让刮板运行了很长时间，并且在解析重定向页面方面一无所获。请记住，我的start_urls列表是动态生成的，有时可能会非常长。我想要完成的是刮板要

检测重定向
停止抓取start_urls或队列中的任何内容
解析重定向
抓取从中重定向的原始页面，并继续抓取start_urls

抓取原始页面不应该太复杂，我已经在else语句中处理过那部分代码。我的主要问题是前三个任务。我想将重定向解析移动到scraper优先级列表的顶部。我该怎么办？

def parse(self, response):
    self.logger.info("got response %d for %r" % (response.status, response.url))

    # handle redirection
    # this is copied/adapted from RedirectMiddleware
    if response.status == 302:

        self.logger.info("Response is 302")

        location = to_native_str(response.headers['location'].decode('latin1'))

        self.logger.info("Location: %s" % location)

        # get the original request
        request = response.request
        # and the URL we got redirected to
        redirected_url = urljoin(request.url, location)

        self.logger.info("Redirected_url: %s" % redirected_url)

        self.logger.info("Yielding redirect instead")
        return scrapy.Request(redirected_url, callback=self.parse, meta={'dont_redirect':True})
    else:
        #parse the redirected page

首先进行Scrapy解析重定向

0 个答案: