为什么我无法在Scrapy中恢复爬行?

时间:2013-07-17 07:24:07

标签: scrapy

我尝试使用以下命令恢复我执行的爬行(后来尝试恢复):

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

但它没有恢复,而是显示以下日志输出:

2013-07-17 12:36:57+0530 [scrapy] INFO: Scrapy 0.16.5 started (bot: thesentientspider)
2013-07-17 12:36:58+0530 [scrapy] DEBUG: Enabled extensions: AutoThrottle, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-17 12:36:59+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleWare, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-17 12:36:59+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-17 12:36:59+0530 [scrapy] DEBUG: Enabled item pipelines: MongoDBPipeline
2013-07-17 12:36:59+0530 [zomatoSpider] INFO: Spider opened
2013-07-17 12:36:59+0530 [zomatoSpider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-17 12:36:59+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6033
2013-07-17 12:36:59+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6090
2013-07-17 12:36:59+0530 [zomatoSpider] DEBUG: Redirecting (301) to <GET http://www.zomato.com/hyderabad/restaurants> from <GET http://www.zomato.com/hyderabad/restaurants/>
2013-07-17 12:37:00+0530 [zomatoSpider] DEBUG: Crawled (200) <GET http://www.zomato.com/hyderabad/restaurants> (referer: None)
2013-07-17 12:37:00+0530 [zomatoSpider] DEBUG: slot: www.zomato.com | conc: 1 | delay: 1000 ms | latency:  283 ms | size:158792 bytes
2013-07-17 12:37:00+0530 [scrapy] DEBUG: Next page URL: http://www.zomato.com/hyderabad/restaurants?page=2
2013-07-17 12:37:00+0530 [zomatoSpider] INFO: Closing spider (finished)
2013-07-17 12:37:00+0530 [zomatoSpider] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 619,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 23308,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/301': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 7, 17, 7, 7, 0, 496989),
     'log_count/DEBUG': 10,
     'log_count/INFO': 4,
     'request_depth_max': 1,
     'response_received_count': 1,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/disk': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/disk': 2,
     'start_time': datetime.datetime(2013, 7, 17, 7, 6, 59, 463810)}
2013-07-17 12:37:00+0530 [zomatoSpider] INFO: Spider closed (finished)

这是我的蜘蛛代码(如果我没有记错的话,这些请求对我来说是可序列化的)。 设置:http://pastebin.com/CUsf4sTJ 蜘蛛:http://pastebin.com/at98Qhjh

我做错了什么?我能以任何方式挽救爬行吗?

1 个答案:

答案 0 :(得分:1)

您继承自BaseSpider,只抓取start_urls。你应该继承CrawlSpider(scrapy.contrib.spiders.CrawlSpider)。