例如: 网址为http://www.wandoujia.com/search?key=saber 它将被重定向到新网址http://www.wandoujia.com/search/3161097853842468421。 我想在scrapy RedirectMiddleware的process_request()中获取新的URL?
以下是我的代码:
class RedirectMiddleware(object):
def process_request(self, request, spider):
new_url = request.url
logging.debug('new_url = ' + new_url)
logging.debug('****************************')
patterns = spider.request_pattern
logging.debug(patterns)
for pattern in patterns:
obj = re.match(pattern, new_url)
if obj:
return Request(new_url)
ps:request.url
是旧网址。我想正确地获取新网址。
答案 0 :(得分:0)
尝试使用类似的方法替换默认中间件(您要查找的方法是process_response
,因为响应“包含重定向”)
class CustomRedirectMiddleware(RedirectMiddleware):
def process_response(self, request, response, spider):
redirected = super(CustomRedirectMiddleware, self).process_response(
request, response, spider)
if isinstance(redirected, request.__class__):
print("Original url: <{}>".format(request.url))
print("Redirected url: <{}>".format(redirected.url))
return redirected
return response