我编写了一个自定义重试中间件来修改标头并为每个失败的请求轮换代理 IP。
尽管重试中间件可以很好地处理不需要的 Cloudflare 页面或 HTTP 状态代码,但在超时时它不会正确旋转 IP 或标头。
知道为什么吗?
class CustomRetryMiddleware(RetryMiddleware):
def __init__(self, settings):
super().__init__(settings)
self.headers = headers_generator(headers=True)
settings = get_project_settings()
self.proxies = settings.get('proxies')
print(f"Proxies are: {self.proxies}")
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
http_codes = [403, 405, 404, *self.retry_http_codes]
request.headers = Headers(self.headers.generate())
request.meta['proxy'] = f"http://{random.choice(self.proxies)}"
print(request.meta)
print(request.headers)
if response.status in http_codes:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
if (b'Pardon' in response.body or b'403 ERROR' in response.body):
print(f'Detected Custom ban from Cloudfare for : {response.url}')
return self._retry(request, "Cloudfare Detection",
spider) or response
return response
下面附上一些日志:
2020-12-30 02:06:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-12-30 02:06:18 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
{'download_timeout': 15.0, 'download_slot': 'www.example-website.gr', 'download_latency': 0.12942719459533691, 'proxy': 'http://139.162.60.99:8080'}
{b'Accept': [b'*/*'], b'Connection': [b'keep-alive'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 OPR/52.0.2871.40'], b'Accept-Encoding': [b'gzip, deflate, br'], b'Dnt': [b'1'], b'Referer': [b'https://google.com']}
2020-12-30 02:06:18 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example-website.gr/sitemap> (failed 1 times): 403 Forbidden
{'download_timeout': 15.0, 'download_slot': 'www.example-website.gr', 'download_latency': 3.5957388877868652, 'proxy': 'http://103.211.10.14:52616', 'retry_times': 1}
{b'Accept': [b'*/*'], b'Connection': [b'keep-alive'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0.3) Gecko/20100101 Firefox/63.0.3'], b'Dnt': [b'1'], b'Referer': [b'https://google.com']}
2020-12-30 02:06:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example-website.gr/sitemap> (failed 2 times): 405 Method Not Allowed
2020-12-30 02:06:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example-website.gr/sitemap> (failed 3 times): User timeout caused connection failure.
2020-12-30 02:06:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example-website.gr/sitemap> (failed 4 times): User timeout caused connection failure: Getting https://www.example-website.gr/sitemap took longer than 15.0 seconds..
2020-12-30 02:07:07 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example-website.gr/sitemap> (failed 5 times): User timeout caused connection failure: Getting https://www.example-website.gr/sitemap took longer than 15.0 seconds..
2020-12-30 02:07:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)