尝试使用scrapy从网站解析产品名称和价格。但是,当我运行我的scrapy代码时,它既不显示任何错误也不提取任何数据。我做错了是我无法找到的。希望有人可以看看它。
“items.py”包括:
import scrapy
class SephoraItem(scrapy.Item):
Name = scrapy.Field()
Price = scrapy.Field()
名为“sephorasp.py”的蜘蛛文件包含:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class SephoraspSpider(CrawlSpider):
name = "sephorasp"
allowed_domains = ['sephora.ae']
start_urls = ["https://www.sephora.ae/en/stores/"]
rules = [
Rule(LinkExtractor(restrict_xpaths='//li[@class="level0 nav-1 active first touch-dd parent"]')),
Rule(LinkExtractor(restrict_xpaths='//li[@class="level2 nav-1-1-1 active first"]'),
callback="parse_item")
]
def parse_item(self, response):
page = response.xpath('//div[@class="product-info"]')
for titles in page:
Product = titles.xpath('.//a[@title]/text()').extract()
Rate = titles.xpath('.//span[@class="price"]/text()').extract()
yield {'Name':Product,'Price':Rate}
以下是日志的链接: “https://www.dropbox.com/s/8xktgh7lvj4uhbh/output.log?dl=0”
当我玩BaseSpider时它可以工作:
from scrapy.spider import BaseSpider
from scrapy.http.request import Request
class SephoraspSpider(BaseSpider):
name = "sephorasp"
allowed_domains = ['sephora.ae']
start_urls = [
"https://www.sephora.ae/en/travel-size/make-up",
"https://www.sephora.ae/en/perfume/women-perfume",
"https://www.sephora.ae/en/makeup/eye/eyeshadow",
"https://www.sephora.ae/en/skincare/moisturizers",
"https://www.sephora.ae/en/gifts/palettes"
]
def pro(self, response):
item_links = response.xpath('//a[contains(@class,"level0")]/@href').extract()
for a in item_links:
yield Request(a, callback = self.end)
def end(self, response):
item_link = response.xpath('//a[@class="level2"]/@href').extract()
for b in item_link:
yield Request(b, callback = self.parse)
def parse(self, response):
page = response.xpath('//div[@class="product-info"]')
for titles in page:
Product= titles.xpath('.//a[@title]/text()').extract()
Rate= titles.xpath('.//span[@class="price"]/text()').extract()
yield {'Name':Product,'Price':Rate}
答案 0 :(得分:1)
你的xpath存在严重缺陷。
Rule(LinkExtractor(restrict_xpaths='//li[@class="level0 nav-1 active first touch-dd parent"]')),
Rule(LinkExtractor(restrict_xpaths='//li[@class="level2 nav-1-1-1 active first"]'),
您正在匹配可以随时更改的整个班级范围,并且scrapy中的顺序可能不同。选择一个班级,它很可能足够独特:
Rule(LinkExtractor(restrict_xpaths='//li[contains(@class,"level0")]')),
Rule(LinkExtractor(restrict_xpaths='//li[contains(@class,"level2")]')),