如何在scrapy中重置标准dupefilter

时间:2015-06-09 17:08:57

标签: scrapy

出于某些原因,我想重置scrapy在我的蜘蛛代码的某个位置内部维护的已查看网址列表。

我知道默认情况下scrapy使用RFPDupeFilter类并且设置了fingerprint

如何在蜘蛛代码中清除此集?

更具体一点:我希望在idle_handler信号调用的自定义spider_idle方法中清除该集合。

2 个答案:

答案 0 :(得分:1)

您可以通过dupefilter访问蜘蛛使用的当前self.crawler.engine.slot.scheduler.df对象。

from scrapy import signals, Spider
from scrapy.xlib.pydispatch import dispatcher


class ExampleSpider(Spider):
    name = "example"
    start_urls = ['http://www.example.com/']

    def __init__(self, *args, **kwargs):
        super(ExampleSpider, self).__init__(*args, **kwargs)
        dispatcher.connect(self.reset_dupefilter, signals.spider_idle)

    def reset_dupefilter(self, spider):
        # clear stored fingerprints by the dupefilter when idle
        self.crawler.engine.slot.scheduler.df.fingerprints = set()

    def parse(self, response):
        pass

答案 1 :(得分:1)

您可以通过初始化指纹来重置指纹集

a,b,c,d

到空集。 将以下代码放入蜘蛛中。

public class CertificateTrainingSchedule
{
   public int Year {get; set;} // Certificate Class Property
   public int TrainingTypeId {get; set;} // Certificate Class Property
   public bool IsApproved {get; set;} // Certificate Class Property
   public DateTime EndDate {get; set;} // TrainingSchedule Class Property
}

var train = db.Certificates
            .Join(db.TrainingSchedules, cert => cert.CertificateId, ts => ts.CertificateId, (cert, ts) => new CertificateTrainingSchedule{ Year = cert.Year, TrainingTypeId = cert.TrainingTypeId, IsApproved = cert.IsApproved,EndDate = ts.EndDate})
            .Where(cts => cts.Year == year)
            .Where(cts => cts.TrainingTypeId == trainingTypeId)
            .Where(cts => cts.IsApproved)
            .Where(cts => cts.EndDate >= DateTime.Now)
            .Select(cts => new {cts.Year,cts.TrainingTypeId,cts.IsApproved})
            .Distinct() // Allowing anonymous type to avoid IEqualityComparer<Certificate>
            .Where(certMain => !db.Registrations.Where(s => s.EmployeeId == empId)
                                                .Select(cert => new Certificate{Year = cert.Year,TrainingTypeId = cert.TrainingTypeId,IsApproved = cert.IsApproved})
                                                .Any(cert => cert.CertificateId == certMain.CertificateId))

请参阅https://github.com/scrapy/scrapy/issues/1762了解详情。