NameError:name' hxs'使用Scrapy时未定义

时间:2014-09-21 02:42:56

标签: web-scraping scrapy

我已经推出了Scrapy shell,并且成功地为维基百科打了折扣。

scrapy shell http://en.wikipedia.org/wiki/Main_Page

我相信这一步是正确的,从Scrapy的反应的冗长性质来判断。

接下来,我想看看写作时会发生什么

hxs.select('/html').extract()

此时,我收到错误:

NameError: name 'hxs' is not defined

有什么问题?我知道Scrapy安装得很好,接受了目的地的URL,但为什么hxs命令会出现问题?

3 个答案:

答案 0 :(得分:7)

我怀疑你正在使用在shell上没有hxs的Scrapy版本。

改为使用sel(在0.24之后弃用,见下文):

$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
>>> sel.xpath('//title/text()').extract()[0]
u'Wikipedia, the free encyclopedia'

或者,从Scrapy 1.0开始,您应该使用response的Selector对象,以及.xpath.css便捷方法:

$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
>>> response.xpath('//title/text()').extract()[0]
u'Wikipedia, the free encyclopedia'

仅供参考,请在Scrapy文档中引用Using selectors

  

...在shell加载后,您将获得response shell变量及其response.selector属性中附加选择器的响应。
  ...
  使用XPath和CSS查询响应非常常见,响应包括两个便捷快捷方式:response.xpath()response.css()

     

>>> response.xpath('//title/text()')
  [<Selector (text) xpath=//title/text()>]
  >>> response.css('title::text')
  [<Selector (text) xpath=//title/text()>]

答案 1 :(得分:0)

您应该使用verbose nature of Scrapy's response.

$ scrapy shell http://en.wikipedia.org/wiki/Main_Page

如果您的详细信息如下:

2014-09-20 23:02:14-0400 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
2014-09-20 23:02:14-0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled item pipelines: 
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-09-20 23:02:15-0400 [default] INFO: Spider opened
2014-09-20 23:02:15-0400 [default] DEBUG: Crawled (200) <GET http://en.wikipedia.org/wiki/Main_Page> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html lang="en" dir="ltr" class="client-'>
[s]   item       {}
[s]   request    <GET http://en.wikipedia.org/wiki/Main_Page>
[s]   response   <200 http://en.wikipedia.org/wiki/Main_Page>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <BaseSpider 'default' at 0xb5d95d8c>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
Python 2.7.6 (default, Mar 22 2014, 22:59:38) 
Type "copyright", "credits" or "license" for more information.

您的详细信息将显示Available Scrapy objects

所以hxssel取决于您在详细信息上显示的内容。对于你的情况hxs不可用,所以你需要使用'sel'(更新的杂乱版本)。因此,对于某些hxs是可以的,其他sel是他们需要使用的

答案 2 :(得分:0)

不推荐使用“sel”快捷方式,你应该使用response.xpath('/ html')。extract()