我想使用proxymiddleware向蜘蛛添加代理,但我不知道为什么它会过滤重复的请求
以下是代码:
class TaylorSpider(CrawlSpider):
name = 'Taylor'
allowed_domains = ['tandfonline.com']
start_urls = ['http://www.tandfonline.com/action/cookieAbsent']
def start_requests(self):
yield Request(self.start_urls[0], dont_filter=True, callback = self.parse_start_url)
def parse_start_url(self, response):
item = TaylorspiderItem()
item['PageUrl'] = response.url
yield item
# middleware.py
class ProxyMiddleware(object):
def process_request(self, request, spider):
logger.info('pr........................')
request.meta['proxy'] = 'http://58.16.86.239:8080'
return request
# setting.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'TaylorSpider.middlewares.ProxyMiddleware': 100,
}
当dont_filter=True
时,它陷入无限循环,日志是
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
然而,当dont_filter=False
时,日志是
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider opened
2017-07-19 13:54:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-19 13:54:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-07-19 13:54:25 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:54:25 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.tandfonline.com/action/cookieAbsent> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-19 13:54:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 422000),
'log_count/DEBUG': 2,
'log_count/INFO': 8,
'log_count/WARNING': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 414000)}
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider closed (finished)
那我该如何解决呢?
答案 0 :(得分:0)
Downloader middlewares' process_request
如果只修补请求并希望框架继续处理,则应返回None
:
process_request()应该:返回None,返回Response 对象,返回一个Request对象,或者引发IgnoreRequest。
如果它返回None,Scrapy将继续处理此请求, 执行所有其他中间件,直到最后,适当的 下载程序处理程序称为执行的请求(及其响应 下载)。
(...)
如果它返回一个Request对象,Scrapy将停止调用 process_request方法并重新安排返回的请求。一旦 执行新返回的请求,相应的中间件链 将在下载的回复中调用。
因此,您希望将return request
放在process_request
的末尾。