我想跟踪在循环中运行多个抓取工具时完成了多少抓取工具。我尝试过的是使用信号,但看起来我的爬虫在其范围之外找不到其他模块。我想要做的是注册爬行是在另一个脚本中完成的,例如通过/更新变量。
示例代码(缩短版本 - 解释问题):
Controller.py
isWikipediaDone = False
for file in Spiders:
process.crawl(file)
print(isWikipediaDone)
wikipediaSpider.py
class WikipediaSpider(scrapy.Spider):
.... initialize ....
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(wikipediaSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def spider_closed(self, spider):
print("Now we are done updating variable in Controller.py!")
Controller.isWikipediaDone = True
答案 0 :(得分:4)
You can create a controller class and later import it in your spider:
# controller.py
class Controller:
def mark_as_done(self, spider):
print("{} is done!".format(spider.name))
controller = Controller()
And connect controller method to your signal inside of your spider:
# myspider.py
from mypackage.controller import controller
...
crawler.signals.connect(controller.mark_as_done, signals.spider_closed)