scrapy抓取工具完成后更新变量

时间:2017-03-29 14:18:54

标签: python scrapy

我想跟踪在循环中运行多个抓取工具时完成了多少抓取工具。我尝试过的是使用信号,但看起来我的爬虫在其范围之外找不到其他模块。我想要做的是注册爬行是在另一个脚本中完成的,例如通过/更新变量。

示例代码(缩短版本 - 解释问题):

Controller.py

isWikipediaDone = False
for file in Spiders:
    process.crawl(file)

print(isWikipediaDone)

wikipediaSpider.py

class WikipediaSpider(scrapy.Spider):
 .... initialize ....

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(wikipediaSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider

    def spider_closed(self, spider):
        print("Now we are done updating variable in Controller.py!")
        Controller.isWikipediaDone = True

1 个答案:

答案 0 :(得分:4)

You can create a controller class and later import it in your spider:

# controller.py
class Controller:
    def mark_as_done(self, spider):
        print("{} is done!".format(spider.name))
controller = Controller()

And connect controller method to your signal inside of your spider:

# myspider.py
from mypackage.controller import controller
...
crawler.signals.connect(controller.mark_as_done, signals.spider_closed)