Question

当Scrapy蜘蛛退出时，我发现了很多关于调用函数的信息（即Call a function in Settings from spider Scrapy），但我正在寻找如何调用函数 - 只需一次 - 当蜘蛛打开时。在Scrapy文档中找不到这个。

我有一个多蜘蛛项目，可以抓取事件信息并将其发布到不同的Google日历中。事件信息经常更新，因此在蜘蛛运行之前，我需要清除现有的Google日历信息，以便完全刷新它。我有一个工作函数，可以在传递日历ID时完成此操作。每个蜘蛛都会发布到不同的Google日历，因此我需要能够将蜘蛛内的日历ID传递给清除日历的功能。

我在init.py中定义了一个基本蜘蛛，如下所示：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
## import other stuff I need for the clear_calendar() function

class BaseSpider(CrawlSpider):

    def clear_calendar(self, CalId):

        ## working code to clear the calendar

现在我可以在parse_item中调用该函数，如：

from myproject import BaseSpider

class ExampleSpider(BaseSpider):

    def parse_item(self, response):

       calendarID = 'MycalendarID'
       self.clear_calendar(MycalendarID)

       ## other stuff to do

当然，每次抓取一个项目都会调用该函数，这很荒谬。但是如果我在def parse_item之外移动函数调用，我会得到错误＆＃34; self未定义＆＃34;，或者，如果我删除＆＃34; self＆＃34;，＆＃34; clear_calendar未定义＆＃34;

如何在Scrapy蜘蛛中调用只需要一次参数的函数？或者，有没有更好的方法来解决这个问题？

Answer 1

使用spider_opened信号有一种更好的方法。

我认为在较新版本的scrapy上，有一个spider_opened方法可供您在蜘蛛内使用：

class MySpider(Spider):
    ...        
    calendar_id = 'something'

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
        return spider

    def spider_opened(self):
        calendar_id = self.calendar_id
        # use my calendar_id

scrapy：在蜘蛛打开时调用函数

1 个答案: