我有一个Scrapy项目,它使用自定义中间件和自定义管道来检查和存储Postgres数据库中的条目。中间件看起来有点像这样:
class ExistingLinkCheckMiddleware(object): def __init__(self): ... open connection to database def process_request(self, request, spider): ... before each request check in the DB that the page hasn't been scraped before
管道看起来很相似:
class MachinelearningPipeline(object): def __init__(self): ... open connection to database def process_item(self, item, spider): ... save the item to the database
它工作正常,但是当蜘蛛完成时我无法找到干净地关闭这些数据库连接的方法,这让我感到烦恼。
有谁知道怎么做?
答案 0 :(得分:6)
我认为最好的方法是使用scrapy的信号spider_closed,例如:
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
class ExistingLinkCheckMiddleware(object):
def __init__(self):
# open connection to database
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider, reason):
# close db connection
def process_request(self, request, spider):
# before each request check in the DB
# that the page hasn't been scraped before
另见:
希望有所帮助。