我编写了一个使用scrapy框架来解析产品网站的爬虫。爬虫在突然停止之间而没有完成完整的解析过程。我对此进行了大量研究,大多数答案表明我的抓取工具被网站阻止了。是否有任何机制可以检测我的蜘蛛是被网站拦截还是自行停止?
以下是蜘蛛的信息级日志条目。
2013-09-23 09:59:07+0000 [scrapy] INFO: Scrapy 0.18.0 started (bot: crawler)
2013-09-23 09:59:08+0000 [spider] INFO: Spider opened
2013-09-23 09:59:08+0000 [spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-09-23 10:00:08+0000 [spider] INFO: Crawled 10 pages (at 10 pages/min), scraped 7 items (at 7 items/min)
2013-09-23 10:01:08+0000 [spider] INFO: Crawled 22 pages (at 12 pages/min), scraped 19 items (at 12 items/min)
2013-09-23 10:02:08+0000 [spider] INFO: Crawled 31 pages (at 9 pages/min), scraped 28 items (at 9 items/min)
2013-09-23 10:03:08+0000 [spider] INFO: Crawled 40 pages (at 9 pages/min), scraped 37 items (at 9 items/min)
2013-09-23 10:04:08+0000 [spider] INFO: Crawled 49 pages (at 9 pages/min), scraped 46 items (at 9 items/min)
2013-09-23 10:05:08+0000 [spider] INFO: Crawled 59 pages (at 10 pages/min), scraped 56 items (at 10 items/min)
以下是关闭spider之前日志文件中调试级别条目的最后一部分:
2013-09-25 11:33:24+0000 [spider] DEBUG: Crawled (200) <GET http://url.html> (referer: http://site_name)
2013-09-25 11:33:24+0000 [spider] DEBUG: Scraped from <200 http://url.html>
//scrapped data in json form
2013-09-25 11:33:25+0000 [spider] INFO: Closing spider (finished)
2013-09-25 11:33:25+0000 [spider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 36754,
'downloader/request_count': 103,
'downloader/request_method_count/GET': 103,
'downloader/response_bytes': 390792,
'downloader/response_count': 103,
'downloader/response_status_count/200': 102,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 9, 25, 11, 33, 25, 1359),
'item_scraped_count': 99,
'log_count/DEBUG': 310,
'log_count/INFO': 14,
'request_depth_max': 1,
'response_received_count': 102,
'scheduler/dequeued': 100,
'scheduler/dequeued/disk': 100,
'scheduler/enqueued': 100,
'scheduler/enqueued/disk': 100,
'start_time': datetime.datetime(2013, 9, 25, 11, 23, 3, 869392)}
2013-09-25 11:33:25+0000 [spider] INFO: Spider closed (finished)
仍有待解析的页面,但蜘蛛会停止。
答案 0 :(得分:0)
到目前为止,我知道对于一只蜘蛛:
- 有一些队列或网址池需要解析/解析 方法。您可以指定,将url绑定到特定方法或let 默认的'解析'完成工作。
- 从解析方法中,您必须返回/生成另一个请求,以提供该池或项目
- 当游泳池用完网址或发出停止信号时,蜘蛛会停止爬行。
醇>
如果你分享你的蜘蛛代码会很好,所以我们可以检查这些绑定是否正确。例如,使用SgmlLinkExtractor错误地错过一些绑定很容易。