Scrapy:无法安排

时间:2018-12-03 18:12:36

标签: python scrapy scheduled-tasks

我想每隔几分钟跑一次蜘蛛。为此,在我的项目中放置了以下脚本。

import schedule, os
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

def job():
      process = CrawlerProcess(get_project_settings())
      process.crawl('amazon_spider')
      process.start() # error: twisted.internet.error.ReactorNotRestartable 
      #process.start(stop_after_crawl=False)  #process get stuck

while True:
     schedule.run_pending()
     schedule.every().minutes.do(job)

在此过程中,出现以下错误:

twisted.internet.error.ReactorNotRestartable 或卡住,如果我放入 process.start(stop_after_crawl = False)

在以前发布的stackoverflow中,我也尝试这样做:

from twisted.internet import reactor
from amazon.spiders.amazon_spider import AmazonSpider
from scrapy.crawler import CrawlerRunner

def run_crawl():

    runner = CrawlerRunner({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    })
     deferred = runner.crawl(AmazonSpider)
     deferred.addCallback(reactor.callLater, 10, run_crawl)

     return deferred

     run_crawl()
     reactor.run()   

该过程再次陷入parse函数的中间。我真的不知道下一步该怎么做。如果您有想法,请告诉我。预先谢谢您....(顺便说一下,它不是重复的,因为同一主题的帖子无法解决我的问题。

1 个答案:

答案 0 :(得分:0)

我使用apscheduler

pip install apscheduler

然后

# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

from Demo.spiders.baidu import YourSpider

process = CrawlerProcess(get_project_settings())
scheduler = TwistedScheduler()
scheduler.add_job(process.crawl, 'interval', args=[YourSpider], seconds=10)
scheduler.start()
process.start(False)