Question

我已经推出了Scrapy shell，并且成功地为维基百科打了折扣。

scrapy shell http://en.wikipedia.org/wiki/Main_Page

我相信这一步是正确的，从Scrapy的反应的冗长性质来判断。

接下来，我想看看写作时会发生什么

hxs.select('/html').extract()

此时，我收到错误：

NameError: name 'hxs' is not defined

有什么问题？我知道Scrapy安装得很好，接受了目的地的URL，但为什么hxs命令会出现问题？

Answer 1

我怀疑你正在使用在shell上没有hxs的Scrapy版本。

改为使用sel（在0.24之后弃用，见下文）：

$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
>>> sel.xpath('//title/text()').extract()[0]
u'Wikipedia, the free encyclopedia'

或者，从Scrapy 1.0开始，您应该使用response的Selector对象，以及.xpath和.css便捷方法：

$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
>>> response.xpath('//title/text()').extract()[0]
u'Wikipedia, the free encyclopedia'

仅供参考，请在Scrapy文档中引用Using selectors：

...在shell加载后，您将获得response shell变量及其response.selector属性中附加选择器的响应。
  ...
  使用XPath和CSS查询响应非常常见，响应包括两个便捷快捷方式：response.xpath()和response.css()：

>>> response.xpath('//title/text()')
  [<Selector (text) xpath=//title/text()>]
  >>> response.css('title::text')
  [<Selector (text) xpath=//title/text()>]

Answer 2

您应该使用verbose nature of Scrapy's response.

$ scrapy shell http://en.wikipedia.org/wiki/Main_Page

如果您的详细信息如下：

2014-09-20 23:02:14-0400 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
2014-09-20 23:02:14-0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled item pipelines: 
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-09-20 23:02:15-0400 [default] INFO: Spider opened
2014-09-20 23:02:15-0400 [default] DEBUG: Crawled (200) <GET http://en.wikipedia.org/wiki/Main_Page> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html lang="en" dir="ltr" class="client-'>
[s]   item       {}
[s]   request    <GET http://en.wikipedia.org/wiki/Main_Page>
[s]   response   <200 http://en.wikipedia.org/wiki/Main_Page>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <BaseSpider 'default' at 0xb5d95d8c>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
Python 2.7.6 (default, Mar 22 2014, 22:59:38) 
Type "copyright", "credits" or "license" for more information.

您的详细信息将显示Available Scrapy objects

所以hxs或sel取决于您在详细信息上显示的内容。对于你的情况hxs不可用，所以你需要使用'sel'（更新的杂乱版本）。因此，对于某些hxs是可以的，其他sel是他们需要使用的

Answer 3

不推荐使用“sel”快捷方式，你应该使用response.xpath（'/ html'）。extract（）

NameError：name＆＃39; hxs＆＃39;使用Scrapy时未定义

3 个答案: