为什么这个scrapy代理中间件会产生重复的请求?

时间:2017-07-19 06:05:02

标签: python proxy scrapy scrapy-spider

我想使用proxymiddleware向蜘蛛添加代理,但我不知道为什么它会过滤重复的请求

以下是代码:

class TaylorSpider(CrawlSpider):
    name = 'Taylor'
    allowed_domains = ['tandfonline.com']
    start_urls = ['http://www.tandfonline.com/action/cookieAbsent']

    def start_requests(self):  
        yield Request(self.start_urls[0], dont_filter=True, callback = self.parse_start_url) 

    def parse_start_url(self, response):
        item = TaylorspiderItem()
        item['PageUrl'] = response.url      

        yield item

# middleware.py

class ProxyMiddleware(object):

    def process_request(self, request, spider):
        logger.info('pr........................')
        request.meta['proxy'] = 'http://58.16.86.239:8080'
        return request        


# setting.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    'TaylorSpider.middlewares.ProxyMiddleware': 100,
}      

dont_filter=True时,它陷入无限循环,日志是

2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................

然而,当dont_filter=False时,日志是

2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider opened
2017-07-19 13:54:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-19 13:54:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-07-19 13:54:25 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:54:25 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.tandfonline.com/action/cookieAbsent> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-19 13:54:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'dupefilter/filtered': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 422000),
 'log_count/DEBUG': 2,
 'log_count/INFO': 8,
 'log_count/WARNING': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 414000)}
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider closed (finished)

那我该如何解决呢?

1 个答案:

答案 0 :(得分:0)

Downloader middlewares' process_request如果只修补请求并希望框架继续处理,则应返回None

  

process_request()应该:返回None,返回Response   对象,返回一个Request对象,或者引发IgnoreRequest。

     

如果它返回None,Scrapy将继续处理此请求,   执行所有其他中间件,直到最后,适当的   下载程序处理程序称为执行的请求(及其响应   下载)。

     

(...)

     

如果它返回一个Request对象,Scrapy将停止调用   process_request方法并重新安排返回的请求。一旦   执行新返回的请求,相应的中间件链   将在下载的回复中调用。

因此,您希望将return request放在process_request的末尾。