我正在开发一个Scrapy项目,我写了一个DOWNLOADER MIDDLEWARE,以避免向已经在数据库中的URL发出请求。
DOWNLOADER_MIDDLEWARES = {
'imobotS.utilities.RandomUserAgentMiddleware': 400,
'imobotS.utilities.DupFilterMiddleware': 500,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
我们的想法是在 __ init __ 上连接并加载当前存储在数据库中的所有网址的不同列表,如果已删除的项目已在数据库中,则引发IgnoreRequests。
class DuplicateFilterMiddleware(object):
def __init__(self):
connection = pymongo.Connection('localhost', 12345)
self.db = connection['my_db']
self.db.authenticate('scott', '*****')
self.url_set = self.db.ad.find({'site': 'WEBSITE_NAME'}).distinct('url')
def process_request(self, request, spider):
print "%s - process Request URL: %s" % (spider._site_name, request.url)
if request.url in self.url_set:
raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
else:
return None
因此,由于我想限制WEBSITE_NAME在init上定义的url_list,有没有办法在下载中间件 __ init __ 方法中识别当前的蜘蛛名称?
答案 0 :(得分:2)
是的,您可以在中间件中通过定义from_crawler
类方法并将蜘蛛打开信号连接到函数spider_opened
来访问蜘蛛名称。比起您可以将Spider名称保存在中间件类中。
from scrapy import signals
def __init__(self, crawler):
self.crawler = crawler
self.spider_name = None
return
@classmethod
def from_crawler(cls, crawler):
ext = cls(crawler)
# connect the middleware object to signals
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
# return the middleware object
return ext
def spider_opened(self, spider):
self.spider_name = spider.name
有关信号的更多信息,请检查Signals
答案 1 :(得分:1)
您可以移动process_request
下的网址集,并检查您之前是否已提取该网址。
class DuplicateFilterMiddleware(object):
def __init__(self):
connection = pymongo.Connection('localhost', 12345)
self.db = connection['my_db']
self.db.authenticate('scott', '*****')
self.url_sets = {}
def process_request(self, request, spider):
if not self.url_sets.get(spider._site_name):
self.url_sets[spider._site_name] = self.db.ad.find({'site': spider._site_name}).distinct('url')
print "%s - process Request URL: %s" % (spider._site_name, request.url)
if request.url in self.url_sets[spider._site_name]:
raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
else:
return None
答案 2 :(得分:0)
基于上述@Ahsan Roy,您不必使用信号API(至少在Scrapy 2.4.0中使用):
通过from_crawler
方法,您可以访问蜘蛛(带有其名称)以及所有其他蜘蛛设置。您可以使用此方法将所需的任何参数传递给中间件类的构造函数(即__init__
):
class DuplicateFilterMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
"""This method is called by Scrapy and needs to return an instance of the middleware"""
return cls(crawler.spider, crawler.settings)
def __init__(self, spider, settings):
self.spider_name = spider.name
self.settings = settings
def process_request(self, request, spider):
print("spider {s} is processing stuff".format(s=self.spider_name))
return None # keep processing normally