Scrapy - 在dupefilter中检索蜘蛛对象

时间:2015-09-10 06:10:38

标签: python scrapy

这是scrapy的默认Dupefilter类方法request_seen

class RFPDupeFilter(BaseDupeFilter):

    def request_seen(self, request):
        fp = self.request_fingerprint(request)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

实现自定义dupefilter时。与其他scrapy中间件不同,我无法从此类中检索spider对象

有什么方法可以知道这是spider个对象吗?所以我可以通过蜘蛛蜘蛛来定制它吗?

此外,我不能只实现一个读取网址并将其放入列表的中间件。检查重复项而不是自定义dupefilter。这是因为我需要暂停/恢复抓取并且需要scrapy来默认使用JOBDIR设置存储请求指纹

1 个答案:

答案 0 :(得分:3)

如果你真的想要,那么解决方案可以覆盖RFPDupeFilter的{​​{3}}方法签名,以便它接收2个参数(self, request, spider);还需要覆盖scrapy Scheuler's request_seen方法,因为内部调用了request_seen。您可以像这样创建新的调度程序和新的dupefilter:

# /scheduler.py

from scrapy.core.scheduler import Scheduler


class MyScheduler(Scheduler):

    def enqueue_request(self, request):
        if not request.dont_filter and self.df.request_seen(request, self.spider):
            self.df.log(request, self.spider)
            return False
        dqok = self._dqpush(request)
        if dqok:
            self.stats.inc_value('scheduler/enqueued/disk', spider=self.spider)
        else:
            self._mqpush(request)
            self.stats.inc_value('scheduler/enqueued/memory', spider=self.spider)
        self.stats.inc_value('scheduler/enqueued', spider=self.spider)
        return True

-

# /dupefilters.py

from scrapy.dupefilters import RFPDupeFilter


class MyRFPDupeFilter(RFPDupeFilter):

    def request_seen(self, request, spider):
        fp = self.request_fingerprint(request)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

        # Do things with spider

并在settings.py中设置路径:

# /settings.py

DUPEFILTER_CLASS = 'myproject.dupefilters.MyRFPDupeFilter'
SCHEDULER = 'myproject.scheduler.MyScheduler'