暂停和恢复Scrapy蜘蛛的问题

时间:2016-05-11 13:19:01

标签: scrapy

我在中等规模的网站上进行非常缓慢的抓取,以便尊重他们对网络抓取的指导。这种情况意味着我需要能够暂停和恢复我的蜘蛛。到目前为止,我在命令行中部署蜘蛛时已经启用了持久性:scrapy crawl ngamedallions -s JOBDIR=pass1 -o items.csv

昨晚,这似乎正在伎俩。我测试了我的蜘蛛,发现当我把它完全关闭时,我可以再次启动它,爬行将从我离开的地方恢复。但是今天,蜘蛛从一开始就开始了。我已经检查了pass1目录的内容,而我的requests.seen文件有一些内容,即使我昨晚抓取的3000页的1600行看起来有点亮。

无论如何,当我试图恢复我的蜘蛛时,有没有人知道我哪里出错?

更新

我继续前进并手动跳过我的蜘蛛,继续昨天的爬行。当我尝试用相同的命令关闭并恢复蜘蛛时(见上文),它起作用了。我的日志开始反映了蜘蛛识别正在恢复爬行。

2016-05-11 10:59:36 [scrapy] INFO: Scrapy 1.0.5.post4+g4b324a8 started (bot: ngamedallions)
2016-05-11 10:59:36 [scrapy] INFO: Optional features available: ssl, http11
2016-05-11 10:59:36 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.csv', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'USER_AGENT': 'ngamedallions', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 10}
2016-05-11 10:59:36 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-05-11 10:59:36 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-11 10:59:36 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-11 10:59:36 [scrapy] INFO: Enabled item pipelines: NgamedallionsCsvPipeline, NgamedallionsImagesPipeline
2016-05-11 10:59:36 [scrapy] INFO: Spider opened
2016-05-11 10:59:36 [scrapy] INFO: Resuming crawl (3 requests scheduled)

然而,当我尝试在第二次正常关闭(暂停 - 恢复 - 暂停 - 恢复)后恢复蜘蛛时,它会再次开始爬行。在这种情况下,日志的开头如下,但主要的内容是蜘蛛不报告将爬行识别为已恢复。

2016-05-11 11:19:10 [scrapy] INFO: Scrapy 1.0.5.post4+g4b324a8 started (bot: ngamedallions)
2016-05-11 11:19:10 [scrapy] INFO: Optional features available: ssl, http11
2016-05-11 11:19:10 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.csv', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'USER_AGENT': 'ngamedallions', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 10}
2016-05-11 11:19:11 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-05-11 11:19:11 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-11 11:19:11 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-11 11:19:11 [scrapy] INFO: Enabled item pipelines: NgamedallionsCsvPipeline, NgamedallionsImagesPipeline
2016-05-11 11:19:11 [scrapy] INFO: Spider opened

1 个答案:

答案 0 :(得分:1)

Scrapy避免重复的网址抓取,herehere您可以找到有关它的更多信息。

  

dont_filter(boolean) - 表示此请求不应该是   由调度程序过滤。当您想要执行时使用此选项   多次相同的请求,忽略重复过滤器。使用   它小心翼翼,否则你会陷入爬行循环。默认为False。

另外,请查看此question