我尝试使用http://doc.scrapy.org/en/0.22/topics/jobs.html中描述的spider.state,但是我收到了错误
MyCrawlSpider has no attribute 'state'
我尝试在CrawlSpider派生类的 init ()函数中使用它。这可能是问题吗?
class MyCrawlSpider(CrawlSpider):
crawl_start = datetime.utcnow().isoformat()
def __init__(self, *args, **kwargs):
super(MyCrawlSpider, self).__init__(*args, **kwargs)
if self.state.get('crawl_start'):
crawl_start = self.state.get('crawl_start')
else:
self.state["crawl_start"] = crawl_start
我的目标是让crawl_start属性始终位于我的抓取工具首先启动的isoformat datetime字符串上,与x恢复启动时无关
答案 0 :(得分:2)
根据source code,state
处理程序中的scrapy.contrib.spiderstate.SpiderState
extension在蜘蛛上设置了class SpiderState(object):
"""Store and load spider state during a scraping job"""
...
def spider_closed(self, spider):
if self.jobdir:
with open(self.statefn, 'wb') as f:
pickle.dump(spider.state, f, protocol=2)
def spider_opened(self, spider):
if self.jobdir and os.path.exists(self.statefn):
with open(self.statefn, 'rb') as f:
spider.state = pickle.load(f)
else:
spider.state = {}
属性:
__init__()
信号的发送晚于正在执行的state
方法 - 蜘蛛实例上还没有{{1}}属性 - 这就是您收到错误的原因。