Question

我正在创建一个scrapy项目，我从网页上抓取（显然！）特定数据。

items = sel.xpath('//div[@class="productTiles cf"]/ul').extract()
     for item in items:
            price = sel.xpath('//ul/li[@class="productPrice"]/span/span[@class="salePrice"]').extract()
            print price

这将产生以下结果：

u'<span class="salePrice">$20.43\xa0<span class="reducedFrom">$40.95</span></span>',     
u'<span class="salePrice">$20.93\xa0<span class="reducedFrom">$40.95</span></span>

我需要得到的只是salePrice，例如分别为20.43和20.93，而忽略其他标记和其余数据。这里的任何帮助将不胜感激。

Answer 1

看起来解决方案如下：

//ul/li[@class="productPrice"]/span/span[@class="salePrice"]//text()

它会抓住我正在寻找的正确元素的文字，如下所示：

u'$20.43\xa0', u'$20.93\xa0'

现在我可以解析它以删除最后的不必要的垃圾，我就设置了。如果有人有更优雅的解决方案，我很乐意看到它。

Answer 2

span[@class="salePrice"]及其子女返回span。

这应仅包含顶部span的文字：

sel.xpath('//ul/li[@class="productPrice"]/span/span[@class="salePrice"]/text()').extract()[0]

从选择器中删除子节点

2 个答案: