我在Scrapy 0.20.0 上有一个 BaseSpider 。但我正在尝试收集找到的网站URL的数量,并在蜘蛛完成(关闭)时将其打印为INFO。问题是我无法在会话结束时打印这个简单的整数变量,并且parse()
或parse_item()
函数中的任何print语句打印得太早,很久之前。
我也查看了this问题,但它看起来有点过时,并且不清楚如何正确使用它。即把它放在哪里(myspider.py,pipelines.py等)?
现在我的蜘蛛代码就像:
class MySpider(BaseSpider):
...
foundWebsites = 0
...
def parse(self, response):
...
print "Found %d websites in this session.\n\n" % (self.foundWebsites)
def parse_item(self, response):
...
if item['website']:
self.foundWebsites += 1
...
这显然不符合预期。有更好更简单的想法吗?
答案 0 :(得分:1)
第一个答案referred to有效,无需向pipelines.py添加任何其他内容。只需在您的蜘蛛代码中添加“答案”,如下所示:
# To use "spider_closed" we also need:
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
class MySpider(BaseSpider):
...
foundWebsites = 0
...
def parse(self, response):
...
def parse_item(self, response):
...
if item['website']:
self.foundWebsites += 1
...
def __init__(self):
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
if spider is not self:
return
print "Found %d websites in this session.\n\n" % (self.foundWebsites)