刮取cdiscount(scrapy)Python时出现错误503

时间:2018-10-19 12:39:54

标签: python scrapy

我制作了一个蜘蛛,可以在cdiscount网站上收集数据。但是,每次我抓取一个类别的320页以上时,都会出现503错误和蜘蛛网关闭。

如何处理该问题?我尝试过更改用户代理并使用像这样的代理池:

def __init__(self, *args, **kwargs):
    super(CdiscountSpider, self).__init__(*args, **kwargs)
    self.proxy_pool = ['49.236.220.238:52840',  '181.112.41.50:33381', '50.235.111.161:45126']

(...)

       request = scrapy.Request(url, callback=self.parse_dir_contents) #on accède au contenu des catégories
       request.meta["proxy"] = random.choice(self.proxy_pool)
       yield request

但是没有用。请,任何帮助,不胜感激:)

2 个答案:

答案 0 :(得分:0)

您可以拥有一个下载中间件,该中间件将继续使用新的代理重试具有503响应的URL,直到成功将其删除为止

创建一个名为custom_middleware.py的文件

import random
import logging

class CustomMiddleware(object):

    proxy_pool = ['49.236.220.238:52840',  '181.112.41.50:33381', '50.235.111.161:45126']

    def process_request(self, request, spider):

        request.meta['proxy'] = “http://“ + random.choice(self.proxy_pool)


    def process_response(self, request, response, spider):

        if response.status in [503]:
            logging.error("%s found for %s so retrying"%(response.status, response.url))
            req = request.copy()
            req.dont_filter = True
            req.meta['proxy'] =  “http://“ + random.choice(self.proxy_pool)
            return req
        else:
            return response

并在您的settings.py中启用该中间件

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 1,
    'YOUR_PROJECT_PATH.custom_middleware.CustomMiddleware': 200,
}

答案 1 :(得分:0)

@Umair:

我收到的新消息:它保持在:

2018-10-19 18:09:38 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up 
retrying <GET https://www.cdiscount.com/au-quotidien/hygiene-soin- 
beaute/shampoings/accessoires-pour-cheveux/l-127020901-321.html> (failed 3 
times): 503 Service Unavailable
2018-10-19 18:09:38 [root] ERROR: 503 found for 
https://www.cdiscount.com/au-quotidien/hygiene-soin- 
beaute/shampoings/accessoires-pour-cheveux/l-127020901-321.html so retrying

没有中间件设置:

2018-10-19 17:33:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying 
<GET https://www.cdiscount.com/au-quotidien/hygiene-soin- 
beaute/shampoings/accessoires-pour-cheveux/l-127020901-321.html> (failed 1 
times): 503 Service Unavailable
2018-10-19 17:33:30 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying 
<GET https://www.cdiscount.com/au-quotidien/hygiene-soin- 
beaute/shampoings/accessoires-pour-cheveux/l-127020901-321.html> (failed 2 
times): 503 Service Unavailable
2018-10-19 17:33:33 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up 
retrying <GET https://www.cdiscount.com/au-quotidien/hygiene-soin- 
beaute/shampoings/accessoires-pour-cheveux/l-127020901-321.html> (failed 3 
times): 503 Service Unavailable
2018-10-19 17:33:33 [scrapy.core.engine] DEBUG: Crawled (503) <GET 
https://www.cdiscount.com/au-quotidien/hygiene-soin- 
beaute/shampoings/accessoires-pour-cheveux/l-127020901-321.html> (referer: 
https://www.cdiscount.com/au-quotidien/hygiene-soin- 
beaute/shampoings/accessoires-pour-cheveux/l-127020901-320.html)
2018-10-19 17:33:33 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-19 17:33:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 417892,
'downloader/request_count': 945,
'downloader/request_method_count/GET': 945,
'downloader/response_bytes': 47181633,
'downloader/response_count': 945,
'downloader/response_status_count/200': 942,
'downloader/response_status_count/503': 3,
'dupefilter/filtered': 935,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 10, 19, 15, 33, 33, 943375),
'item_scraped_count': 44038,
'log_count/DEBUG': 44986,
'log_count/INFO': 9,
'request_depth_max': 321,
'response_received_count': 943,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/503 Service Unavailable': 2,
'scheduler/dequeued': 945,
'scheduler/dequeued/memory': 945,
'scheduler/enqueued': 945,
'scheduler/enqueued/memory': 945,
'start_time': datetime.datetime(2018, 10, 19, 15, 30, 53, 892275)}
2018-10-19 17:33:33 [scrapy.core.engine] INFO: Spider closed (finished)

使用中间件设置:

2018-10-19 17:16:53 [cdis_bot] ERROR: <twisted.python.failure.Failure 
builtins.TypeError: to_bytes must receive a unicode, str or bytes object, got 
NoneType>
2018-10-19 17:16:53 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-19 17:16:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/builtins.TypeError': 1,
'downloader/request_bytes': 417452,
'downloader/request_count': 944,
'downloader/request_method_count/GET': 944,
'downloader/response_bytes': 47157342,
'downloader/response_count': 943,
'downloader/response_status_count/200': 943,
'dupefilter/filtered': 936,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 10, 19, 15, 16, 53, 504711),
'httpcache/hit': 943,
'httpcache/miss': 1,
'item_scraped_count': 44131,
'log_count/DEBUG': 45077,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'log_count/WARNING': 1,
'request_depth_max': 321,
'response_received_count': 943,
'scheduler/dequeued': 944,
'scheduler/dequeued/memory': 944,
'scheduler/enqueued': 944,
'scheduler/enqueued/memory': 944,
'start_time': datetime.datetime(2018, 10, 19, 15, 15, 15, 871700)}
2018-10-19 17:16:53 [scrapy.core.engine] INFO: Spider closed (finished)