python中的xpath不会占用整个HTML块

时间:2014-07-16 15:28:15

标签: python xpath scrapy

我正在使用scrapy从网站上删除信息。我的xpath正在运行,但它不会从块中获取信息。

Python代码:

sel.xpath('//div[@class="content"]/div/blockquote/node()').extract()[0]

我正在使用它来抓取页面上的第一个blockquote。它会在<br>之后切断。

例如:

如果我能看到这个:

<blockquote class="postcontent restore ">
4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)
<br>
Operating System
<br>
Windows 8.1 64
<br>
Display
</blockquote>

它只会返回:

  第四代英特尔酷睿i7-4710HQ处理器(2.50GHz 1600MHz 6MB)

但是我希望它返回所有内容,包括html标签和blockquote中的其他文本。

1 个答案:

答案 0 :(得分:1)

//div[@class="content"]/div/blockquote/node()将为您提供 a blockquote,子文本节点和元素节点下的所有节点。

在您的情况下,您将获得文本节点和<br> s

sel.xpath('//div[@class="content"]/div/blockquote/node()').extract()[0]将仅提取第一个节点,即具有“第四代英特尔酷睿i7-4710HQ处理器(2.50GHz 1600MHz 6MB)”的文本节点

以下是使用选择器显示不同输出的示例ipython会话:

$ ipython
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
Type "copyright", "credits" or "license" for more information.

IPython 1.2.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import scrapy

In [2]: selector = scrapy.selector.Selector(text="""<blockquote class="postcontent restore ">
   ...: 4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)
   ...: <br>
   ...: Operating System
   ...: <br>
   ...: Windows 8.1 64
   ...: <br>
   ...: Display
   ...: </blockquote>""")

In [3]: selector.xpath('blockquote/node()').extract()
Out[3]: []

In [4]: selector.xpath('.//blockquote/node()').extract()
Out[4]: 
[u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n',
 u'<br>',
 u'\nOperating System\n',
 u'<br>',
 u'\nWindows 8.1 64\n',
 u'<br>',
 u'\nDisplay\n']

In [5]: selector.xpath('.//blockquote').extract()
Out[5]: [u'<blockquote class="postcontent restore ">\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n<br>\nOperating System\n<br>\nWindows 8.1 64\n<br>\nDisplay\n</blockquote>']

In [6]: selector.xpath('string(.//blockquote)').extract()
Out[6]: [u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n\nOperating System\n\nWindows 8.1 64\n\nDisplay\n']

In [7]: selector.xpath('.//blockquote//text()').extract()
Out[7]: 
[u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n',
 u'\nOperating System\n',
 u'\nWindows 8.1 64\n',
 u'\nDisplay\n']

In [8]: "\n".join(selector.xpath('.//blockquote//text()').extract())
Out[8]: u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n\n\nOperating System\n\n\nWindows 8.1 64\n\n\nDisplay\n'

In [9]: 

在OP的评论之后,合适的是(//div[@class="content"]/div/blockquote)[1]//text()

使用OP的原始输入页面:

$ scrapy shell http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/
2014-07-16 20:43:45+0200 [scrapy] INFO: Scrapy 0.24.2 started (bot: scrapybot)
2014-07-16 20:43:45+0200 [scrapy] INFO: Optional features available: ssl, http11, boto
2014-07-16 20:43:45+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-07-16 20:43:45+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-07-16 20:43:46+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-07-16 20:43:46+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-07-16 20:43:46+0200 [scrapy] INFO: Enabled item pipelines: 
2014-07-16 20:43:46+0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-07-16 20:43:46+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-07-16 20:43:46+0200 [default] INFO: Spider opened
2014-07-16 20:43:47+0200 [default] DEBUG: Crawled (200) <GET http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f63775b0c10>
[s]   item       {}
[s]   request    <GET http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/>
[s]   response   <200 http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/>
[s]   settings   <scrapy.settings.Settings object at 0x7f6377c4fd90>
[s]   spider     <Spider 'default' at 0x7f6376d52bd0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: response.xpath('//div[@class="content"]/div/blockquote')
Out[1]: 
[<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>]

In [2]: response.xpath('(//div[@class="content"]/div/blockquote)[1]')
Out[2]: [<Selector xpath='(//div[@class="content"]/div/blockquote)[1]' data=u'<blockquote class="postcontent restore "'>]

In [3]: response.xpath('(//div[@class="content"]/div/blockquote)[1]//text()')
Out[3]: 
[<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n\t\t\t\tGot a coupon that stated 50% off a'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\nCode is CAG5014'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\nDeal is on! '>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u"Don't Forget to tip driver!!">,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n\t\t\t'>]

In [4]: response.xpath('string((//div[@class="content"]/div/blockquote)[1])').extract()
Out[4]: [u"\r\n\t\t\t\tGot a coupon that stated 50% off any pizza at menu price. \r\n\r\nCode is CAG5014\r\n\r\nDeal is on! \r\n\r\nDon't Forget to tip driver!!\r\n\r\n\r\n\t\t\t"]

In [5]: response.xpath('normalize-space((//div[@class="content"]/div/blockquote)[1])').extract()
Out[5]: [u"Got a coupon that stated 50% off any pizza at menu price. Code is CAG5014 Deal is on! Don't Forget to tip driver!!"]

In [6]: