我已经推出了Scrapy shell,并且成功地为维基百科打了折扣。
scrapy shell http://en.wikipedia.org/wiki/Main_Page
我相信这一步是正确的,从Scrapy的反应的冗长性质来判断。
接下来,我想看看写作时会发生什么
hxs.select('/html').extract()
此时,我收到错误:
NameError: name 'hxs' is not defined
有什么问题?我知道Scrapy安装得很好,接受了目的地的URL,但为什么hxs
命令会出现问题?
答案 0 :(得分:7)
我怀疑你正在使用在shell上没有hxs
的Scrapy版本。
改为使用sel
(在0.24之后弃用,见下文):
$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
>>> sel.xpath('//title/text()').extract()[0]
u'Wikipedia, the free encyclopedia'
或者,从Scrapy 1.0开始,您应该使用response
的Selector对象,以及.xpath
和.css
便捷方法:
$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
>>> response.xpath('//title/text()').extract()[0]
u'Wikipedia, the free encyclopedia'
仅供参考,请在Scrapy文档中引用Using selectors:
...在shell加载后,您将获得
response
shell变量及其response.selector
属性中附加选择器的响应。
...
使用XPath和CSS查询响应非常常见,响应包括两个便捷快捷方式:response.xpath()
和response.css()
:
>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]
答案 1 :(得分:0)
您应该使用verbose nature of Scrapy's response.
$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
如果您的详细信息如下:
2014-09-20 23:02:14-0400 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
2014-09-20 23:02:14-0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled item pipelines:
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-09-20 23:02:15-0400 [default] INFO: Spider opened
2014-09-20 23:02:15-0400 [default] DEBUG: Crawled (200) <GET http://en.wikipedia.org/wiki/Main_Page> (referer: None)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html lang="en" dir="ltr" class="client-'>
[s] item {}
[s] request <GET http://en.wikipedia.org/wiki/Main_Page>
[s] response <200 http://en.wikipedia.org/wiki/Main_Page>
[s] settings <CrawlerSettings module=None>
[s] spider <BaseSpider 'default' at 0xb5d95d8c>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
Python 2.7.6 (default, Mar 22 2014, 22:59:38)
Type "copyright", "credits" or "license" for more information.
您的详细信息将显示Available Scrapy objects
所以hxs
或sel
取决于您在详细信息上显示的内容。对于你的情况hxs
不可用,所以你需要使用'sel'(更新的杂乱版本)。因此,对于某些hxs
是可以的,其他sel
是他们需要使用的
答案 2 :(得分:0)