我制作了一个蜘蛛,可以在cdiscount网站上收集数据。但是,每次我抓取一个类别的320页以上时,都会出现503错误和蜘蛛网关闭。
如何处理该问题?我尝试过更改用户代理并使用像这样的代理池:
def __init__(self, *args, **kwargs):
super(CdiscountSpider, self).__init__(*args, **kwargs)
self.proxy_pool = ['49.236.220.238:52840', '181.112.41.50:33381', '50.235.111.161:45126']
(...)
request = scrapy.Request(url, callback=self.parse_dir_contents) #on accède au contenu des catégories
request.meta["proxy"] = random.choice(self.proxy_pool)
yield request
但是没有用。请,任何帮助,不胜感激:)
答案 0 :(得分:0)
您可以拥有一个下载中间件,该中间件将继续使用新的代理重试具有503
响应的URL,直到成功将其删除为止
创建一个名为custom_middleware.py
的文件
import random
import logging
class CustomMiddleware(object):
proxy_pool = ['49.236.220.238:52840', '181.112.41.50:33381', '50.235.111.161:45126']
def process_request(self, request, spider):
request.meta['proxy'] = “http://“ + random.choice(self.proxy_pool)
def process_response(self, request, response, spider):
if response.status in [503]:
logging.error("%s found for %s so retrying"%(response.status, response.url))
req = request.copy()
req.dont_filter = True
req.meta['proxy'] = “http://“ + random.choice(self.proxy_pool)
return req
else:
return response
并在您的settings.py
中启用该中间件
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 1,
'YOUR_PROJECT_PATH.custom_middleware.CustomMiddleware': 200,
}
答案 1 :(得分:0)
@Umair:
我收到的新消息:它保持在:
2018-10-19 18:09:38 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up
retrying <GET https://www.cdiscount.com/au-quotidien/hygiene-soin-
beaute/shampoings/accessoires-pour-cheveux/l-127020901-321.html> (failed 3
times): 503 Service Unavailable
2018-10-19 18:09:38 [root] ERROR: 503 found for
https://www.cdiscount.com/au-quotidien/hygiene-soin-
beaute/shampoings/accessoires-pour-cheveux/l-127020901-321.html so retrying
没有中间件设置:
2018-10-19 17:33:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying
<GET https://www.cdiscount.com/au-quotidien/hygiene-soin-
beaute/shampoings/accessoires-pour-cheveux/l-127020901-321.html> (failed 1
times): 503 Service Unavailable
2018-10-19 17:33:30 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying
<GET https://www.cdiscount.com/au-quotidien/hygiene-soin-
beaute/shampoings/accessoires-pour-cheveux/l-127020901-321.html> (failed 2
times): 503 Service Unavailable
2018-10-19 17:33:33 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up
retrying <GET https://www.cdiscount.com/au-quotidien/hygiene-soin-
beaute/shampoings/accessoires-pour-cheveux/l-127020901-321.html> (failed 3
times): 503 Service Unavailable
2018-10-19 17:33:33 [scrapy.core.engine] DEBUG: Crawled (503) <GET
https://www.cdiscount.com/au-quotidien/hygiene-soin-
beaute/shampoings/accessoires-pour-cheveux/l-127020901-321.html> (referer:
https://www.cdiscount.com/au-quotidien/hygiene-soin-
beaute/shampoings/accessoires-pour-cheveux/l-127020901-320.html)
2018-10-19 17:33:33 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-19 17:33:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 417892,
'downloader/request_count': 945,
'downloader/request_method_count/GET': 945,
'downloader/response_bytes': 47181633,
'downloader/response_count': 945,
'downloader/response_status_count/200': 942,
'downloader/response_status_count/503': 3,
'dupefilter/filtered': 935,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 10, 19, 15, 33, 33, 943375),
'item_scraped_count': 44038,
'log_count/DEBUG': 44986,
'log_count/INFO': 9,
'request_depth_max': 321,
'response_received_count': 943,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/503 Service Unavailable': 2,
'scheduler/dequeued': 945,
'scheduler/dequeued/memory': 945,
'scheduler/enqueued': 945,
'scheduler/enqueued/memory': 945,
'start_time': datetime.datetime(2018, 10, 19, 15, 30, 53, 892275)}
2018-10-19 17:33:33 [scrapy.core.engine] INFO: Spider closed (finished)
使用中间件设置:
2018-10-19 17:16:53 [cdis_bot] ERROR: <twisted.python.failure.Failure
builtins.TypeError: to_bytes must receive a unicode, str or bytes object, got
NoneType>
2018-10-19 17:16:53 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-19 17:16:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/builtins.TypeError': 1,
'downloader/request_bytes': 417452,
'downloader/request_count': 944,
'downloader/request_method_count/GET': 944,
'downloader/response_bytes': 47157342,
'downloader/response_count': 943,
'downloader/response_status_count/200': 943,
'dupefilter/filtered': 936,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 10, 19, 15, 16, 53, 504711),
'httpcache/hit': 943,
'httpcache/miss': 1,
'item_scraped_count': 44131,
'log_count/DEBUG': 45077,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'log_count/WARNING': 1,
'request_depth_max': 321,
'response_received_count': 943,
'scheduler/dequeued': 944,
'scheduler/dequeued/memory': 944,
'scheduler/enqueued': 944,
'scheduler/enqueued/memory': 944,
'start_time': datetime.datetime(2018, 10, 19, 15, 15, 15, 871700)}
2018-10-19 17:16:53 [scrapy.core.engine] INFO: Spider closed (finished)