CrawlSpider仅抓取start_urls

时间:2015-12-05 15:10:31

标签: python web-crawler scrapy scrapy-spider

所有,我试图建立一个蜘蛛,它将抓住sherdog网站并给我所有战士的名字(姓名,出生日期,身高,国籍)。如果我运行蜘蛛,它会解析start_urls。问题是crawlspider不会开始爬行,所以我最终只得到2个解析的项目。我阅读了文档,但也是Scrapy的新手,所以我可能会错过一些东西。你有什么想法吗?该网站使用相对网址,所以我首先想到的可能是问题,但在建立绝对网址之后,它仍然没有用。我真的希望你们能帮助我!

2015-12-07 18:15:11 [scrapy] INFO: Scrapy 1.0.3 started (bot: ufcfights)
2015-12-07 18:15:11 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-12-07 18:15:11 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ufcfights.spiders', 'FEED_URI': 'items.csv', 'SPIDER_MODULES': ['ufcfights.spiders'], 'BOT_NAME': 'ufcfights', 'USER_AGENT': 'Chrome/46.0.2490.80', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 1.0}
2015-12-07 18:15:12 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-07 18:15:12 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-07 18:15:12 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-07 18:15:12 [scrapy] INFO: Enabled item pipelines:
2015-12-07 18:15:12 [scrapy] INFO: Spider opened
2015-12-07 18:15:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-07 18:15:12 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-07 18:15:12 [scrapy] DEBUG: Crawled (200) <GET http://www.sherdog.com/fighter/Daniel-Cormier-52311> (referer: None)
2015-12-07 18:15:13 [scrapy] DEBUG: Crawled (200) <GET http://www.sherdog.com/fighter/Ronda-Rousey-73073> (referer: None)
2015-12-07 18:15:14 [scrapy] INFO: Closing spider (finished)
2015-12-07 18:15:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 452,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 46874,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 12, 7, 17, 15, 14, 92000),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2015, 12, 7, 17, 15, 12, 618000)}
 2015-12-07 18:15:14 [scrapy] INFO: Spider closed (finished)

日志:

{
  "directory" : "public/components"
}

1 个答案:

答案 0 :(得分:1)

使用parse时,您无法覆盖CrawlSpider方法,它会在内部使用parse作为规则。检查警告here

只需更改规则的回调方法。