为什么scrapy没有给出一个项目的结果?

时间:2015-08-12 23:09:39

标签: python xpath web-crawler scrapy

我想在scrapy中获取价格和卖家名称,但无法在正确的xpath中解析它们以便迭代它们。如何获得正确的xpath以便我可以检索卖家和所有价格?

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from scrapy.contrib.linkextractors import LinkExtractor



class mspItem(scrapy.Item):
    model_name = scrapy.Field()

    price  = scrapy.Field()
    seller = scrapy.Field()


class criticspider(CrawlSpider):
    name = "msp_specs"
    allowed_domains = ["mysmartprice.com/"]
    #### Give array of URLS here, it will generate specs.json, run clean.py on it, mentioning words to include and remove ####
    start_urls = ["http://www.mysmartprice.com/mobile/microsoft-lumia-535-msp5042"]


    def parse(self, response):
        sites = response.xpath('//div[@id="pricetable"]//div[@class="store_pricetable"]')
        items = []
        item = mspItem()
        item['model_name'] = response.xpath('//h2[contains(@class,"priceindia")]/text()').extract()
        for site in sites:

            #item["seller"] = site.xpath("/@data-storename").extract()[0]
            item['price'] = site.xpath('//div[store_price_out]/text()').extract()



            items.append(item)
        return items

更新代码 -

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from scrapy.contrib.linkextractors import LinkExtractor



class mspItem(scrapy.Item):
    model_name = scrapy.Field()

    price  = scrapy.Field()
    seller = scrapy.Field()


class criticspider(CrawlSpider):
    name = "msp_specs"
    allowed_domains = ["mysmartprice.com/"]
    #### Give array of URLS here, it will generate specs.json, run clean.py on it, mentioning words to include and remove ####
    start_urls = ["http://www.mysmartprice.com/mobile/microsoft-lumia-535-msp5042"]




    def parse(self, response):
            sites = response.xpath('//div[contains(@class,"store_pricetable")]')
            items = []
            for site in sites:

                item = mspItem()
                item['model_name'] = response.xpath('//h2[contains(@class,"priceindia")]/text()').extract()
                item['price'] = site.xpath('.//div[@class="store_price"]/text()').extract()



                items.append(item)
            return items

1 个答案:

答案 0 :(得分:0)

我猜你sites的第一个xpath是错误的,根据网站的来源,这是错误的,因为div的类属性为{{1} } {}不是'store_pricetable'的小孩div

此外,还有一些div类为 - 'pricetable'

因此,您可以在此处使用'store_pricetable featured_seller'查看获取contains()的所有div。

'store_pricetable'的xpath也是错误的,就像你做的那样 - price,它不会检查你的意图。

您应该 - //div[store_price_out] - 与开始时的item['price'] = site.xpath('.//div[@class="store_price"]/text()').extract()一起使此xpath相对于当前元素。

此外,您应该为每个循环重新创建项目,不要反复使用相同的项目对象,它会覆盖前一个项目对象。

示例 -

dot