Scrapy既不显示任何错误也不提取任何数据

时间:2017-04-18 22:53:30

标签: python web-scraping scrapy

尝试使用scrapy从网站解析产品名称和价格。但是,当我运行我的scrapy代码时,它既不显示任何错误也不提取任何数据。我做错了是我无法找到的。希望有人可以看看它。

“items.py”包括:

import scrapy
class SephoraItem(scrapy.Item):
    Name = scrapy.Field()
    Price = scrapy.Field()
名为“sephorasp.py”的蜘蛛文件包含:

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor

class SephoraspSpider(CrawlSpider):
    name = "sephorasp"
    allowed_domains = ['sephora.ae']
    start_urls = ["https://www.sephora.ae/en/stores/"]
    rules = [
            Rule(LinkExtractor(restrict_xpaths='//li[@class="level0 nav-1 active first touch-dd  parent"]')),
            Rule(LinkExtractor(restrict_xpaths='//li[@class="level2 nav-1-1-1 active first"]'),
            callback="parse_item")
    ]

    def parse_item(self, response):
        page = response.xpath('//div[@class="product-info"]')
        for titles in page:
            Product = titles.xpath('.//a[@title]/text()').extract()
            Rate = titles.xpath('.//span[@class="price"]/text()').extract()
            yield {'Name':Product,'Price':Rate}

以下是日志的链接: “https://www.dropbox.com/s/8xktgh7lvj4uhbh/output.log?dl=0

当我玩BaseSpider时它可以工作:

from scrapy.spider import BaseSpider
from scrapy.http.request import Request

class SephoraspSpider(BaseSpider):
    name = "sephorasp"
    allowed_domains = ['sephora.ae']
    start_urls = [
                    "https://www.sephora.ae/en/travel-size/make-up",
                    "https://www.sephora.ae/en/perfume/women-perfume",
                    "https://www.sephora.ae/en/makeup/eye/eyeshadow",
                    "https://www.sephora.ae/en/skincare/moisturizers",
                    "https://www.sephora.ae/en/gifts/palettes"

    ]

    def pro(self, response):
        item_links = response.xpath('//a[contains(@class,"level0")]/@href').extract()
        for a in item_links:
            yield Request(a, callback = self.end)

    def end(self, response):
        item_link = response.xpath('//a[@class="level2"]/@href').extract()
        for b in item_link:
            yield Request(b, callback = self.parse)

    def parse(self, response):
        page = response.xpath('//div[@class="product-info"]')
        for titles in page:
            Product= titles.xpath('.//a[@title]/text()').extract()
            Rate= titles.xpath('.//span[@class="price"]/text()').extract()
            yield {'Name':Product,'Price':Rate}

1 个答案:

答案 0 :(得分:1)

你的xpath存在严重缺陷。

Rule(LinkExtractor(restrict_xpaths='//li[@class="level0 nav-1 active first touch-dd  parent"]')),
Rule(LinkExtractor(restrict_xpaths='//li[@class="level2 nav-1-1-1 active first"]'),

您正在匹配可以随时更改的整个班级范围,并且scrapy中的顺序可能不同。选择一个班级,它很可能足够独特:

Rule(LinkExtractor(restrict_xpaths='//li[contains(@class,"level0")]')),
Rule(LinkExtractor(restrict_xpaths='//li[contains(@class,"level2")]')),