Question

我想访问变量self.cursor以使用活动的postgreSQL连接，但我无法弄清楚如何访问scrapy的管道类实例。

class ScrapenewsPipeline(object):

  def open_spider(self, spider):
      self.connection = psycopg2.connect(
        host= os.environ['HOST_NAME'],
        user=os.environ['USERNAME'],
        database=os.environ['DATABASE_NAME'],
        password=os.environ['PASSWORD'])
      self.cursor = self.connection.cursor()
      self.connection.set_session(autocommit=True)


  def close_spider(self, spider):
      self.cursor.close()
      self.connection.close() 


  def process_item(self, item, spider):
      print ("Some Magic Happens Here")


  def checkUrlExist(self, item):
      print("I want to call this function from my spider to access the 
    self.cursor variable")

请注意，我意识到我可以使用process_item访问yield item，但该功能正在执行其他操作，我希望通过self.cursor checkUrlExist访问该连接并且能够随意从我的蜘蛛中调用类的实例！谢谢。

Answer 1

您可以在此处spider.variable_name访问所有蜘蛛类变量。

class MySpider(scrapy.Spider):
        name = "myspider"
        any_variable = "any_value"

你的管道

class MyPipeline(object):
    def process_item(self, item, spider):
        spider.any_variable

我建议您在Spider类中创建一个连接，就像我在我的示例中声明any_variable一样，可以使用self.any_variable在您的Spider中访问，并且在您的管道中，它可以通过spider.any_variable

Answer 2

我知道我来这里参加聚会有点晚了，但是如果有人在寻找正确的答案，可以通过以下方式访问任何管道或中间件（或者，例如下载器等）实例。搜寻器物件，可控制其他所有物件。您可以通过使用from_crawler类方法在初始化时设置.crawler属性来在Spider中访问爬虫。

在scrapy shell中进行一些挖掘，您应该能够找到当前爬网中正在使用的任何对象的实例。

蜘蛛中间件crawler.engine.scraper.spidermw.middlewares
下载器中间件crawler.engine.downloader.middleware.middlewares
项目管道crawler.engine.scraper.itemproc.middlewares（请考虑一下。这只是基于对刮板外壳的初步探索）

请注意，我并不是在倡导从蜘蛛访问数据库连接对象的人这样做。只是任何Scrapy对象实例都可以通过搜寻器对象访问，这是按标题对OP问题的答案。

访问scrapy管道类的实例

2 个答案: