无法使用python scrapy在p标签/元素内部刮取文本

时间:2013-10-15 10:25:14

标签: python scrapy

我想使用x-path
http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html从网站//*[@id="product_addtocart_form"]/div[7]/div/div[1]/h1/p中提取产品名称。

我已经尝试了以下但结果没有得到任何结果 item['pname'] = ' '.join(hxs.select('//*[@id="product_addtocart_form"]/div[7]/div/div[1]/h1/p/text()').extract()).strip()

1 个答案:

答案 0 :(得分:0)

lxml周围必定存在某种h1解析问题,因为//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//p//text()将包含您想要的文本节点,

//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//h1//text()不会,但您想要的p位于h1元素内。

该页面区域的HTML源代码为:

<div class="product-shop detail-right">
    <div class="prcdt-overview">
        <div class="title">
                                    <h1>
                <div class="htag">Vincent Chase</div>
                <p itemprop="name"> Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses</p>
            </h1>
            <span style="text-align:center;color:#329C92;font-size:12px;padding-top:5px">Product Id: 73871</span>
        </div>               

        <div id="container2" style="display: none;">
            <div class="product-options" id="product-options-wrapper">

看看这个scrapy shell会话:

paul@wheezy:~$ scrapy shell http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html
2013-10-15 13:16:33+0200 [scrapy] INFO: Scrapy 0.18.2 started (bot: scrapybot)
2013-10-15 13:16:34+0200 [default] INFO: Spider opened
2013-10-15 13:16:35+0200 [default] DEBUG: Crawled (200) <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html class="no-js"><!--<![endif]--><hea'>
[s]   item       {}
[s]   request    <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s]   response   <200 http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <BaseSpider 'default' at 0x354c310>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
Type "copyright", "credits" or "license" for more information.

IPython 0.13.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//h1//text()').extract()
Out[1]: 
[u'\n                            ',
 u'Vincent Chase',
 u'\n                            ']

In [2]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//p//text()').extract()
Out[2]: 
[u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses',
 u'Enter the details below as they appear on your prescription from your doctor. ',
 u'Understand Your Prescription.',
 u'Retail Store Price - Rs 1600',
 u'You Save - Rs 800',
 u'Retail Store Price - Rs 4500',
 u'You Save - Rs 1010',
 u'STATUS: ',
 u'READY TO SHIP\t',
 u'(LIMITED STOCK)',
 u'    ',
 u'Delivered By 20 Oct,2013']

In [4]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]/div[@class="title"]//div[@class="htag"]//text()').extract()
Out[4]: [u'Vincent Chase']

In [5]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]/div[@class="title"]//p//text()').extract()
Out[5]: [u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses']

In [6]: 

<强>建议:

此网站/网页使用“itemscope”和“itemtype”属性(请参阅http://schema.org/docs/gs.html#microdata_itemscope_itemtype),因此我建议您使用它们来提取所需的数据。

例如,您可以使用此XPath表达式:

//*[@itemscope and @itemtype="http://schema.org/Product"]
    //*[@itemprop="name"]/text()

使用HtmlXPathSelector,您可以使用

In [1]: ''.join(hxs.select('//*[@itemscope and @itemtype="http://schema.org/Product"]//*[@itemprop="name"]/text()').extract()).strip()
Out[1]: u'Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses'

示例scrapy shell会话:

paul@wheezy:~$ scrapy shell http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html
2013-10-15 12:47:30+0200 [scrapy] INFO: Scrapy 0.18.2 started (bot: scrapybot)
2013-10-15 12:47:31+0200 [default] INFO: Spider opened
2013-10-15 12:47:32+0200 [default] DEBUG: Crawled (200) <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html class="no-js"><!--<![endif]--><hea'>
[s]   item       {}
[s]   request    <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s]   response   <200 http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <BaseSpider 'default' at 0x3f54310>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
Type "copyright", "credits" or "license" for more information.

IPython 0.13.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: hxs.select("""
   ...: //*[@itemscope and @itemtype="http://schema.org/Product"]
   ...:     //*[@itemprop="name"]/text()""")
Out[1]: [<HtmlXPathSelector xpath='\n//*[@itemscope and @itemtype="http://schema.org/Product"]\n    //*[@itemprop="name"]/text()' data=u' Colorato VC 5134 Matt Black Grey Gradie'>]

In [2]: hxs.select("""
//*[@itemscope and @itemtype="http://schema.org/Product"]
    //*[@itemprop="name"]/text()""").extract()
Out[2]: [u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses']

In [3]: ''.join(hxs.select("""
//*[@itemscope and @itemtype="http://schema.org/Product"]
    //*[@itemprop="name"]/text()""").extract()).strip()
Out[3]: u'Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses'

In [4]: