使用Scrapy和Xpath </div> </div>选择特定<div>中包含另一个<div>的文本内容

时间:2014-10-09 05:49:27

标签: python html xpath web-scraping scrapy

编辑:解决了!对于那些在学习中遇到这种情况的人;答案如下,保罗很好地解释和提供。

这是我在这里的第一个问题,我搜索和搜索(目前为止两天)无济于事。我想抓一个特定的零售网站来获取产品名称和价格。

目前,我有一个蜘蛛在一个特定的零售网站上工作,然而,另一个零售网站,它有点工作。我可以正确地获得产品名称,但我无法以正确的格式获得价格。

首先,这是我目前的蜘蛛代码:

import scrapy

from projectname.items import projectItem

class spider_whatever(scrapy.Spider):
    name = "whatever"
    allowed_domain = ["domain.com"]
    start_urls = ["http://www.domain.com"]

    def parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div@class="container"]')
        product = requests.xpath('.//*[@class="productname"/text()]').extract()
        price = requests.xpath('.//*[@class="price"]').extract() #Issue lies here.

        itemlist = []
        for product, price in zip(product, price):
            item = projectItem()
            item['product'] = product.strip().upper()
            item['price'] = price.strip()
            itemlist.append(item)
        return itemlist

现在价格的目标HTML是:

<div id="listPrice1" class="price">
                        $622                        <div class="cents">.00</div>
                    </div>

正如你所看到的,它不仅是凌乱的,它还有我想要引用的div中的div。现在,当我去尝试这样做时:

price = requests.xpath('.//*[@class="price"]/text()').extract()

它吐出了这个:

product,price
some_product1, $100
some_product2, 
some_product3, $200
some_product4, 

什么时候应该吐出来:

product,price
some_product1, $100
some_product2, $200
some_product3, $300
some_product4, $400

我认为它正在做的是;它也提取div class =&#34; cents&#34;并将其分配给下一个产品,从而将下一个值推下一个。

当我尝试通过Google Docs Spreadsheet抓取数据时,它会将产品放在一列中,价格分为两列;第一个是$ amount,第二个是.00美分,如下所示:

product,price,cents
some_product1, $100, .00
some_product2, $200, .00
some_product3, $300, .00
some_product4, $400, .00

所以我的问题是,如何在div中分隔div。有没有一种特殊的方法可以将其从XPath中排除,还是可以在解析数据时将其过滤掉?如果我可以过滤掉它,我该怎么做?

非常感谢任何帮助。请理解,我是Python的新手,我正在努力学习。

1 个答案:

答案 0 :(得分:3)

让我们探索一些不同的XPath模式:

>>> import scrapy
>>> selector = scrapy.Selector(text="""<div id="listPrice1" class="price">
...                         $622                        <div class="cents">.00</div>
...                     </div>""")

# /text() will select all text nodes under the context not,
# here any element with class "price"
# there are 2 of them
>>> selector.xpath('.//*[@class="price"]/text()').extract()
[u'\n                        $622                        ', u'\n                    ']

# if you wrap the context node inside the "string()" function,
# you'll get the string representation of the node,
# basically a concatenation of text elements
>>> selector.xpath('string(.//*[@class="price"])').extract()
[u'\n                        $622                        .00\n                    ']

# using "normalize-space()" instead of "string()",
# it will replace multiple space with 1 space character
>>> selector.xpath('normalize-space(.//*[@class="price"])').extract()
[u'$622 .00']

# you could also ask for the 1st text node under the element with class "price"
>>> selector.xpath('.//*[@class="price"]/text()[1]').extract()
[u'\n                        $622                        ']

# space-normalized version of that may do what you want
>>> selector.xpath('normalize-space(.//*[@class="price"]/text()[1])').extract()
[u'$622']
>>> 

所以,最后,你可能会遵循这种模式:

def parse(self, response):
    sel = scrapy.Selector(response)
    requests = sel.xpath('//div@class="container"]')
    itemlist = []
    for r in requests:
        item = projectItem()
        item['product'] = r.xpath('normalize-space(.//*[@class="productname"])').extract()
        item['price'] = r.xpath('normalize-space(.//*[@class="price"]/text()[1])').extract()
        itemlist.append(item)
    return itemlist