如何在蜘蛛停止爬行或遇到异常后退出Scrapy Python脚本?

时间:2017-04-18 15:01:05

标签: python scrapy

我试图每分钟从Window的Task Scheduler中的bat文件运行我的Scrapy的python脚本。

然而,python脚本以某种方式没有退出,它阻止了任务调度程序的所有未来任务启动。

所以,我的问题是,

  1. 如何在蜘蛛完成运行后优雅地退出我的Scrapy脚本?

  2. 遇到异常时如何退出Scrapy脚本,尤其是ReactorNotRunning错误?

  3. 提前全部谢谢。

    这是我运行python脚本的bat文件

    @echo off
    python "C:\Scripts\start.py"
    pause
    

    这是我的python脚本

    from cineplex.spiders import seatings_spider as seat
    import scrapy
    from scrapy.crawler import CrawlerProcess
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    from scrapy.utils.project import get_project_settings
    import sys
    import time
    from twisted.internet import reactor, defer
    
    
    def crawl_all_showtimes():
        # Create a CrawlerRunner instance to manage multiple spider simultaneously
        runner = CrawlerRunner()
    
        # Check folder for today
        directory_for_today = utils.create_dir_for_today(PARENT_DIR)
    
        # Get all cinema id and names first
        cinema_dict = utils.get_all_cinemas()
    
        # Prepare for crawling
        crawl_showtimes_helper(directory_for_today, cinema_dict, runner)
    
        # Start Crawling for Showtimes
        reactor.run()
    
    
    # Helps to run multiple ShowTimesSpiders sequentially
    @defer.inlineCallbacks
    def crawl_showtimes_helper(output_dir, cinema_dict, runner):
        # Iterate through all cinema to get show timings
        for cinema_id, cinema_name in cinema_dict.iteritems():
            yield runner.crawl(st.ShowTimesSpider, cinema_id=cinema_id,     cinema_name=cinema_name, output_dir=output_dir )
        reactor.stop()
    
    if __name__ == "__main__":
    
        # Turns on Scrapy Logging
        configure_logging()
    
        # Collect all Seatings
        crawl_all_seatings()
    

1 个答案:

答案 0 :(得分:-1)

程序的主线程阻塞了一些Scrapy线程。所以在你的主程序中使用:

import sys;
sys.exit()