如何在scrapy RedirectMiddleware的process_request()中获取重写的url?

时间:2017-11-22 06:57:51

标签: scrapy

例如: 网址为http://www.wandoujia.com/search?key=saber 它将被重定向到新网址http://www.wandoujia.com/search/3161097853842468421。 我想在scrapy RedirectMiddleware的process_request()中获取新的URL?

以下是我的代码:

class RedirectMiddleware(object):
    def process_request(self, request, spider):
        new_url = request.url
        logging.debug('new_url = ' + new_url)
        logging.debug('****************************')
        patterns = spider.request_pattern
        logging.debug(patterns)
        for pattern in patterns:
            obj = re.match(pattern, new_url)
            if obj: 
                return Request(new_url)

ps:request.url是旧网址。我想正确地获取新网址。

1 个答案:

答案 0 :(得分:0)

尝试使用类似的方法替换默认中间件(您要查找的方法是process_response,因为响应“包含重定向”)

class CustomRedirectMiddleware(RedirectMiddleware):
    def process_response(self, request, response, spider):
        redirected = super(CustomRedirectMiddleware, self).process_response(
            request, response, spider)
        if isinstance(redirected, request.__class__):
            print("Original url: <{}>".format(request.url))
            print("Redirected url: <{}>".format(redirected.url))
            return redirected
        return response