我正在使用scrapy从网站上删除信息。我的xpath正在运行,但它不会从块中获取信息。
Python代码:
sel.xpath('//div[@class="content"]/div/blockquote/node()').extract()[0]
我正在使用它来抓取页面上的第一个blockquote。它会在<br>
之后切断。
例如:
如果我能看到这个:
<blockquote class="postcontent restore ">
4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)
<br>
Operating System
<br>
Windows 8.1 64
<br>
Display
</blockquote>
它只会返回:
第四代英特尔酷睿i7-4710HQ处理器(2.50GHz 1600MHz 6MB)
但是我希望它返回所有内容,包括html标签和blockquote中的其他文本。
答案 0 :(得分:1)
//div[@class="content"]/div/blockquote/node()
将为您提供 a blockquote
,子文本节点和元素节点下的所有节点。
在您的情况下,您将获得文本节点和<br>
s
sel.xpath('//div[@class="content"]/div/blockquote/node()').extract()[0]
将仅提取第一个节点,即具有“第四代英特尔酷睿i7-4710HQ处理器(2.50GHz 1600MHz 6MB)”的文本节点
以下是使用选择器显示不同输出的示例ipython会话:
$ ipython
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
Type "copyright", "credits" or "license" for more information.
IPython 1.2.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import scrapy
In [2]: selector = scrapy.selector.Selector(text="""<blockquote class="postcontent restore ">
...: 4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)
...: <br>
...: Operating System
...: <br>
...: Windows 8.1 64
...: <br>
...: Display
...: </blockquote>""")
In [3]: selector.xpath('blockquote/node()').extract()
Out[3]: []
In [4]: selector.xpath('.//blockquote/node()').extract()
Out[4]:
[u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n',
u'<br>',
u'\nOperating System\n',
u'<br>',
u'\nWindows 8.1 64\n',
u'<br>',
u'\nDisplay\n']
In [5]: selector.xpath('.//blockquote').extract()
Out[5]: [u'<blockquote class="postcontent restore ">\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n<br>\nOperating System\n<br>\nWindows 8.1 64\n<br>\nDisplay\n</blockquote>']
In [6]: selector.xpath('string(.//blockquote)').extract()
Out[6]: [u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n\nOperating System\n\nWindows 8.1 64\n\nDisplay\n']
In [7]: selector.xpath('.//blockquote//text()').extract()
Out[7]:
[u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n',
u'\nOperating System\n',
u'\nWindows 8.1 64\n',
u'\nDisplay\n']
In [8]: "\n".join(selector.xpath('.//blockquote//text()').extract())
Out[8]: u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n\n\nOperating System\n\n\nWindows 8.1 64\n\n\nDisplay\n'
In [9]:
在OP的评论之后,合适的是(//div[@class="content"]/div/blockquote)[1]//text()
使用OP的原始输入页面:
$ scrapy shell http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/
2014-07-16 20:43:45+0200 [scrapy] INFO: Scrapy 0.24.2 started (bot: scrapybot)
2014-07-16 20:43:45+0200 [scrapy] INFO: Optional features available: ssl, http11, boto
2014-07-16 20:43:45+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-07-16 20:43:45+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-07-16 20:43:46+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-07-16 20:43:46+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-07-16 20:43:46+0200 [scrapy] INFO: Enabled item pipelines:
2014-07-16 20:43:46+0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-07-16 20:43:46+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-07-16 20:43:46+0200 [default] INFO: Spider opened
2014-07-16 20:43:47+0200 [default] DEBUG: Crawled (200) <GET http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f63775b0c10>
[s] item {}
[s] request <GET http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/>
[s] response <200 http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/>
[s] settings <scrapy.settings.Settings object at 0x7f6377c4fd90>
[s] spider <Spider 'default' at 0x7f6376d52bd0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: response.xpath('//div[@class="content"]/div/blockquote')
Out[1]:
[<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>]
In [2]: response.xpath('(//div[@class="content"]/div/blockquote)[1]')
Out[2]: [<Selector xpath='(//div[@class="content"]/div/blockquote)[1]' data=u'<blockquote class="postcontent restore "'>]
In [3]: response.xpath('(//div[@class="content"]/div/blockquote)[1]//text()')
Out[3]:
[<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n\t\t\t\tGot a coupon that stated 50% off a'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\nCode is CAG5014'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\nDeal is on! '>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u"Don't Forget to tip driver!!">,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n\t\t\t'>]
In [4]: response.xpath('string((//div[@class="content"]/div/blockquote)[1])').extract()
Out[4]: [u"\r\n\t\t\t\tGot a coupon that stated 50% off any pizza at menu price. \r\n\r\nCode is CAG5014\r\n\r\nDeal is on! \r\n\r\nDon't Forget to tip driver!!\r\n\r\n\r\n\t\t\t"]
In [5]: response.xpath('normalize-space((//div[@class="content"]/div/blockquote)[1])').extract()
Out[5]: [u"Got a coupon that stated 50% off any pizza at menu price. Code is CAG5014 Deal is on! Don't Forget to tip driver!!"]
In [6]: