此排除是参考Scrapy spider does not store state (persistent state)
我已按照以下链接保留了抓取工具http://doc.scrapy.org/en/latest/topics/jobs.html
的状态现在,当抓取工具正确结束时,通过中断或Ctrl + C可以正常工作。
我注意到
时蜘蛛没有正常关闭蜘蛛再次跑动时,会在第一个抓取的网址上自行关闭。
如果发生上述情况,如何实现爬虫的持久状态? 原因或者它最终再次抓取整堆网址。
蜘蛛再次运行时记录:
2016-08-30 08:14:11 [scrapy] INFO: Scrapy 1.1.2 started (bot: maxverstappen)
2016-08-30 08:14:11 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'maxverstappen.spiders', 'SPIDER_MODULES': ['maxverstappen.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'maxverstappen'}
2016-08-30 08:14:11 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.spiderstate.SpiderState']
2016-08-30 08:14:11 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-08-30 08:14:11 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-08-30 08:14:12 [scrapy] INFO: Enabled item pipelines:
['maxverstappen.pipelines.MaxverstappenPipeline']
2016-08-30 08:14:12 [scrapy] INFO: Spider opened
2016-08-30 08:14:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-30 08:14:12 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.inautonews.com/robots.txt> (referer: None)
2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.thecheckeredflag.com/robots.txt> (referer: None)
2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.inautonews.com/> (referer: None)
2016-08-30 08:14:12 [scrapy] DEBUG: Crawled (200) <GET http://www.thecheckeredflag.com/> (referer: None)
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.inautonews.com/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.newsnow.co.uk': <GET http://www.newsnow.co.uk/h/Life+&+Style/Motoring>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.americanmuscle.com': <GET http://www.americanmuscle.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.extremeterrain.com': <GET http://www.extremeterrain.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.autoanything.com': <GET http://www.autoanything.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.bmwcoop.com': <GET http://www.bmwcoop.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.automotorblog.com': <GET http://www.automotorblog.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/inautonews>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/inautonews>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'plus.google.com': <GET https://plus.google.com/+Inautonewsplus>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.histats.com': <GET http://www.histats.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.hamiltonf1site.com': <GET http://www.hamiltonf1site.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.joshwellsracing.com': <GET http://www.joshwellsracing.com/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.jensonbuttonfan.net': <GET http://www.jensonbuttonfan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.fernandoalonsofan.net': <GET http://www.fernandoalonsofan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.markwebberfan.net': <GET http://www.markwebberfan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.felipemassafan.net': <GET http://www.felipemassafan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.nicorosbergfan.net': <GET http://www.nicorosbergfan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.nickheidfeldfan.net': <GET http://www.nickheidfeldfan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.lewishamiltonblog.net': <GET http://www.lewishamiltonblog.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.timoglockfan.net': <GET http://www.timoglockfan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.jarnotrullifan.net': <GET http://www.jarnotrullifan.net/>
2016-08-30 08:14:12 [scrapy] DEBUG: Filtered offsite request to 'www.brunosennafan.net': <GET http://www.brunosennafan.net/>
2016-08-30 08:14:12 [scrapy] INFO: Closing spider (finished)
2016-08-30 08:14:12 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 896,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 35353,
'downloader/response_count': 4,
'downloader/response_status_count/200': 4,
'dupefilter/filtered': 149,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 30, 8, 14, 12, 724932),
'log_count/DEBUG': 28,
'log_count/INFO': 7,
'offsite/domains': 22,
'offsite/filtered': 23,
'request_depth_max': 1,
'response_received_count': 4,
'scheduler/dequeued': 2,
'scheduler/dequeued/disk': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/disk': 2,
'start_time': datetime.datetime(2016, 8, 30, 8, 14, 12, 13456)}
2016-08-30 08:14:12 [scrapy] INFO: Spider closed (finished)
答案 0 :(得分:0)
这样做的一种方法是通过拥有两个蜘蛛来分离发现和消费者逻辑。一个发现产品的网址是另一个使用这些网址并返回每个网址的结果。如果由于某种原因,消费者在运行中死亡,它可以轻松地恢复爬行,因为发现队列不受此崩溃的影响。
已经有一个伟大的scrapy工具进行scrapy的工具了。它被称为Frontera
Frontera是一个网络爬行框架,由爬行边界和 分布/缩放原语,允许建立大规模 在线网络爬虫。
Frontera负责管理期间遵循的逻辑和政策 爬行。它存储并优先处理由爬网程序提取的链接 决定接下来要访问哪些页面,并且能够进行访问 分布式的方式。
听起来很复杂,但它非常直截了当。但是,如果你运行小规模和一个关闭,你可能只想手动接近它。您可以在json中运行发现蜘蛛并输出结果,然后在您的消费者蜘蛛中以持久的方式解析该json(即从中弹出值)。