如何在爬虫突然死亡时坚持状态?

时间:2016-08-30 07:50:16

标签: python scrapy web-crawler

此排除是参考Scrapy spider does not store state (persistent state)

我已按照以下链接保留了抓取工具http://doc.scrapy.org/en/latest/topics/jobs.html

的状态

现在,当抓取工具正确结束时,通过中断或Ctrl + C可以正常工作。

我注意到

时蜘蛛没有正常关闭
  1. 您多次按Ctrl + C.
  2. 服务器容量被点击。
  3. 由于其突然结束的任何其他原因
  4. 蜘蛛再次跑动时,会在第一个抓取的网址上自行关闭。

    如果发生上述情况,如何实现爬虫的持久状态? 原因或者它最终再次抓取整堆网址。

    蜘蛛再次运行时记录:

    2016-08-30 08:14:11 [scrapy] INFO: Scrapy 1.1.2 started (bot: maxverstappen)
    2016-08-30 08:14:11 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'maxverstappen.spiders', 'SPIDER_MODULES': ['maxverstappen.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'maxverstappen'}
    2016-08-30 08:14:11 [scrapy] INFO: Enabled extensions:
    ['scrapy.extensions.logstats.LogStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.spiderstate.SpiderState']
    2016-08-30 08:14:11 [scrapy] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2016-08-30 08:14:11 [scrapy] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2016-08-30 08:14:12 [scrapy] INFO: Enabled item pipelines:
    ['maxverstappen.pipelines.MaxverstappenPipeline']
    2016-08-30 08:14:12 [scrapy] INFO: Spider opened
    2016-08-30 08:14:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2016-08-30 08:14:12 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
    2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.inautonews.com/robots.txt> (referer: None)
    2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.thecheckeredflag.com/robots.txt> (referer: None)
    2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.inautonews.com/> (referer: None)
    2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.thecheckeredflag.com/> (referer: None)
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.inautonews.com/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.newsnow.co.uk': <GET http://www.newsnow.co.uk/h/Life+&+Style/Motoring>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.americanmuscle.com': <GET http://www.americanmuscle.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.extremeterrain.com': <GET http://www.extremeterrain.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.autoanything.com': <GET http://www.autoanything.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.bmwcoop.com': <GET http://www.bmwcoop.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.automotorblog.com': <GET http://www.automotorblog.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/inautonews>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/inautonews>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'plus.google.com': <GET https://plus.google.com/+Inautonewsplus>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.histats.com': <GET http://www.histats.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.hamiltonf1site.com': <GET http://www.hamiltonf1site.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.joshwellsracing.com': <GET http://www.joshwellsracing.com/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.jensonbuttonfan.net': <GET http://www.jensonbuttonfan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.fernandoalonsofan.net': <GET http://www.fernandoalonsofan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.markwebberfan.net': <GET http://www.markwebberfan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.felipemassafan.net': <GET http://www.felipemassafan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.nicorosbergfan.net': <GET http://www.nicorosbergfan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.nickheidfeldfan.net': <GET http://www.nickheidfeldfan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.lewishamiltonblog.net': <GET http://www.lewishamiltonblog.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.timoglockfan.net': <GET http://www.timoglockfan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.jarnotrullifan.net': <GET http://www.jarnotrullifan.net/>
    2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.brunosennafan.net': <GET http://www.brunosennafan.net/>
    2016-08-30 08:14:12 [scrapy] INFO: Closing spider (finished)
    2016-08-30 08:14:12 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 896,
     'downloader/request_count': 4,
     'downloader/request_method_count/GET': 4,
     'downloader/response_bytes': 35353,
     'downloader/response_count': 4,
     'downloader/response_status_count/200': 4,
     'dupefilter/filtered': 149,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2016, 8, 30, 8, 14, 12, 724932),
     'log_count/DEBUG': 28,
     'log_count/INFO': 7,
     'offsite/domains': 22,
     'offsite/filtered': 23,
     'request_depth_max': 1,
     'response_received_count': 4,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/disk': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/disk': 2,
     'start_time': datetime.datetime(2016, 8, 30, 8, 14, 12, 13456)}
    2016-08-30 08:14:12 [scrapy] INFO: Spider closed (finished)
    

1 个答案:

答案 0 :(得分:0)

这样做的一种方法是通过拥有两个蜘蛛来分离发现和消费者逻辑。一个发现产品的网址是另一个使用这些网址并返回每个网址的结果。如果由于某种原因,消费者在运行中死亡,它可以轻松地恢复爬行,因为发现队列不受此崩溃的影响。

已经有一个伟大的scrapy工具进行scrapy的工具了。它被称为Frontera

  

Frontera是一个网络爬行框架,由爬行边界和   分布/缩放原语,允许建立大规模   在线网络爬虫。

     

Frontera负责管理期间遵循的逻辑和政策   爬行。它存储并优先处理由爬网程序提取的链接   决定接下来要访问哪些页面,并且能够进行访问   分布式的方式。

听起来很复杂,但它非常直截了当。但是,如果你运行小规模和一个关闭,你可能只想手动接近它。您可以在json中运行发现蜘蛛并输出结果,然后在您的消费者蜘蛛中以持久的方式解析该json(即从中弹出值)。