这是scrapy的默认Dupefilter
类方法request_seen
class RFPDupeFilter(BaseDupeFilter):
def request_seen(self, request):
fp = self.request_fingerprint(request)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
实现自定义dupefilter时。与其他scrapy中间件不同,我无法从此类中检索spider
对象
有什么方法可以知道这是spider
个对象吗?所以我可以通过蜘蛛蜘蛛来定制它吗?
此外,我不能只实现一个读取网址并将其放入列表的中间件。检查重复项而不是自定义dupefilter。这是因为我需要暂停/恢复抓取并且需要scrapy来默认使用JOBDIR
设置存储请求指纹
答案 0 :(得分:3)
如果你真的想要,那么解决方案可以覆盖RFPDupeFilter
的{{3}}方法签名,以便它接收2个参数(self, request, spider)
;还需要覆盖scrapy Scheuler's
request_seen
方法,因为内部调用了request_seen
。您可以像这样创建新的调度程序和新的dupefilter:
# /scheduler.py
from scrapy.core.scheduler import Scheduler
class MyScheduler(Scheduler):
def enqueue_request(self, request):
if not request.dont_filter and self.df.request_seen(request, self.spider):
self.df.log(request, self.spider)
return False
dqok = self._dqpush(request)
if dqok:
self.stats.inc_value('scheduler/enqueued/disk', spider=self.spider)
else:
self._mqpush(request)
self.stats.inc_value('scheduler/enqueued/memory', spider=self.spider)
self.stats.inc_value('scheduler/enqueued', spider=self.spider)
return True
-
# /dupefilters.py
from scrapy.dupefilters import RFPDupeFilter
class MyRFPDupeFilter(RFPDupeFilter):
def request_seen(self, request, spider):
fp = self.request_fingerprint(request)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
# Do things with spider
并在settings.py中设置路径:
# /settings.py
DUPEFILTER_CLASS = 'myproject.dupefilters.MyRFPDupeFilter'
SCHEDULER = 'myproject.scheduler.MyScheduler'