Scrapy 用户超时和代理轮换问题

时间:2020-12-30 00:43:21

标签: python web-scraping scrapy scrapy-middleware

我编写了一个自定义重试中间件来修改标头并为每个失败的请求轮换代理 IP。

尽管重试中间件可以很好地处理不需要的 Cloudflare 页面或 HTTP 状态代码,但在超时时它不会正确旋转 IP 或标头。

知道为什么吗?

class CustomRetryMiddleware(RetryMiddleware):
    def __init__(self, settings):
        super().__init__(settings)
        self.headers = headers_generator(headers=True)
        settings = get_project_settings()
        self.proxies = settings.get('proxies')
        print(f"Proxies are: {self.proxies}")

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response

        http_codes = [403, 405, 404, *self.retry_http_codes]

        request.headers = Headers(self.headers.generate())

        request.meta['proxy'] = f"http://{random.choice(self.proxies)}"

        print(request.meta)
        print(request.headers)

        if response.status in http_codes:
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response

        if (b'Pardon' in response.body or b'403 ERROR' in response.body):
            print(f'Detected Custom ban from Cloudfare for : {response.url}')
            return self._retry(request, "Cloudfare Detection",
                               spider) or response

        return response

下面附上一些日志:

2020-12-30 02:06:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-12-30 02:06:18 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
{'download_timeout': 15.0, 'download_slot': 'www.example-website.gr', 'download_latency': 0.12942719459533691, 'proxy': 'http://139.162.60.99:8080'}
{b'Accept': [b'*/*'], b'Connection': [b'keep-alive'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 OPR/52.0.2871.40'], b'Accept-Encoding': [b'gzip, deflate, br'], b'Dnt': [b'1'], b'Referer': [b'https://google.com']}
2020-12-30 02:06:18 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example-website.gr/sitemap> (failed 1 times): 403 Forbidden
{'download_timeout': 15.0, 'download_slot': 'www.example-website.gr', 'download_latency': 3.5957388877868652, 'proxy': 'http://103.211.10.14:52616', 'retry_times': 1}
{b'Accept': [b'*/*'], b'Connection': [b'keep-alive'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0.3) Gecko/20100101 Firefox/63.0.3'], b'Dnt': [b'1'], b'Referer': [b'https://google.com']}
2020-12-30 02:06:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example-website.gr/sitemap> (failed 2 times): 405 Method Not Allowed
2020-12-30 02:06:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example-website.gr/sitemap> (failed 3 times): User timeout caused connection failure.
2020-12-30 02:06:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example-website.gr/sitemap> (failed 4 times): User timeout caused connection failure: Getting https://www.example-website.gr/sitemap took longer than 15.0 seconds..
2020-12-30 02:07:07 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.example-website.gr/sitemap> (failed 5 times): User timeout caused connection failure: Getting https://www.example-website.gr/sitemap took longer than 15.0 seconds..
2020-12-30 02:07:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

0 个答案:

没有答案
相关问题