Scrapy - 在DOWNLOAD MIDDLEWARE __init__中获取蜘蛛变量

时间:2014-09-05 02:40:11

标签: python scrapy

我正在开发一个Scrapy项目,我写了一个DOWNLOADER MIDDLEWARE,以避免向已经在数据库中的URL发出请求。

DOWNLOADER_MIDDLEWARES = {
   'imobotS.utilities.RandomUserAgentMiddleware': 400,
   'imobotS.utilities.DupFilterMiddleware': 500,
   'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

我们的想法是在 __ init __ 上连接并加载当前存储在数据库中的所有网址的不同列表,如果已删除的项目已在数据库中,则引发IgnoreRequests。

class DuplicateFilterMiddleware(object):

    def __init__(self):
        connection = pymongo.Connection('localhost', 12345)
        self.db = connection['my_db']
        self.db.authenticate('scott', '*****')

        self.url_set = self.db.ad.find({'site': 'WEBSITE_NAME'}).distinct('url')

    def process_request(self, request, spider):
        print "%s - process Request URL: %s" % (spider._site_name, request.url)
        if request.url in self.url_set:
            raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
        else:
            return None

因此,由于我想限制WEBSITE_NAME在init上定义的url_list,有没有办法在下载中间件 __ init __ 方法中识别当前的蜘蛛名称?

3 个答案:

答案 0 :(得分:2)

是的,您可以在中间件中通过定义from_crawler类方法并将蜘蛛打开信号连接到函数spider_opened来访问蜘蛛名称。比起您可以将Spider名称保存在中间件类中。

from scrapy import signals

def __init__(self, crawler):
    self.crawler = crawler
    self.spider_name = None
    return

@classmethod
def from_crawler(cls, crawler):
    ext = cls(crawler)
    # connect the middleware object to signals
    crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
    # return the middleware object
    return ext

def spider_opened(self, spider):
    self.spider_name = spider.name

有关信号的更多信息,请检查Signals

答案 1 :(得分:1)

您可以移动process_request下的网址集,并检查您之前是否已提取该网址。

class DuplicateFilterMiddleware(object):

    def __init__(self):
        connection = pymongo.Connection('localhost', 12345)
        self.db = connection['my_db']
        self.db.authenticate('scott', '*****')

        self.url_sets = {}

    def process_request(self, request, spider):
        if not self.url_sets.get(spider._site_name):
            self.url_sets[spider._site_name] = self.db.ad.find({'site': spider._site_name}).distinct('url')

        print "%s - process Request URL: %s" % (spider._site_name, request.url)
        if request.url in self.url_sets[spider._site_name]:
            raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
        else:
            return None

答案 2 :(得分:0)

基于上述@Ahsan Roy,您不必使用信号API(至少在Scrapy 2.4.0中使用):

通过from_crawler方法,您可以访问蜘蛛(带有其名称)以及所有其他蜘蛛设置。您可以使用此方法将所需的任何参数传递给中间件类的构造函数(即__init__):

class DuplicateFilterMiddleware(object):

    @classmethod
    def from_crawler(cls, crawler):
        """This method is called by Scrapy and needs to return an instance of the middleware"""
        return cls(crawler.spider, crawler.settings)

    def __init__(self, spider, settings):
        self.spider_name = spider.name
        self.settings = settings

    def process_request(self, request, spider):
        print("spider {s} is processing stuff".format(s=self.spider_name))
        return None  # keep processing normally