出于某些原因,我想重置scrapy
在我的蜘蛛代码的某个位置内部维护的已查看网址列表。
我知道默认情况下scrapy使用RFPDupeFilter
类并且设置了fingerprint
。
如何在蜘蛛代码中清除此集?
更具体一点:我希望在idle_handler
信号调用的自定义spider_idle
方法中清除该集合。
答案 0 :(得分:1)
您可以通过dupefilter
访问蜘蛛使用的当前self.crawler.engine.slot.scheduler.df
对象。
from scrapy import signals, Spider
from scrapy.xlib.pydispatch import dispatcher
class ExampleSpider(Spider):
name = "example"
start_urls = ['http://www.example.com/']
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
dispatcher.connect(self.reset_dupefilter, signals.spider_idle)
def reset_dupefilter(self, spider):
# clear stored fingerprints by the dupefilter when idle
self.crawler.engine.slot.scheduler.df.fingerprints = set()
def parse(self, response):
pass
答案 1 :(得分:1)
您可以通过初始化指纹来重置指纹集
a,b,c,d
到空集。 将以下代码放入蜘蛛中。
public class CertificateTrainingSchedule
{
public int Year {get; set;} // Certificate Class Property
public int TrainingTypeId {get; set;} // Certificate Class Property
public bool IsApproved {get; set;} // Certificate Class Property
public DateTime EndDate {get; set;} // TrainingSchedule Class Property
}
var train = db.Certificates
.Join(db.TrainingSchedules, cert => cert.CertificateId, ts => ts.CertificateId, (cert, ts) => new CertificateTrainingSchedule{ Year = cert.Year, TrainingTypeId = cert.TrainingTypeId, IsApproved = cert.IsApproved,EndDate = ts.EndDate})
.Where(cts => cts.Year == year)
.Where(cts => cts.TrainingTypeId == trainingTypeId)
.Where(cts => cts.IsApproved)
.Where(cts => cts.EndDate >= DateTime.Now)
.Select(cts => new {cts.Year,cts.TrainingTypeId,cts.IsApproved})
.Distinct() // Allowing anonymous type to avoid IEqualityComparer<Certificate>
.Where(certMain => !db.Registrations.Where(s => s.EmployeeId == empId)
.Select(cert => new Certificate{Year = cert.Year,TrainingTypeId = cert.TrainingTypeId,IsApproved = cert.IsApproved})
.Any(cert => cert.CertificateId == certMain.CertificateId))