我想抓一个' size'的javascript列表。这个地址的一部分:
我想要做的是获得有货的尺码,它会返回一个清单。我怎么能这样做?
这是我的完整代码:
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request
class ShoesSpider(Spider):
name = "shoes"
allowed_domains = ["store.nike.com"]
start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']
def parse(self, response):
shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
for shoe in shoes:
yield Request(shoe, callback=self.parse_shoes)
def parse_shoes(self, response):
name = response.xpath('//*[@itemprop="name"]/text()').extract_first()
price = response.xpath('//*[@itemprop="price"]/text()').extract_first()
#sizes = ??
yield {
'name' : name,
'price' : price,
'sizes' : sizes
}
由于
答案 0 :(得分:1)
以下是提取库存尺寸的代码。
import scrapy
class ShoesSpider(scrapy.Spider):
name = "shoes"
allowed_domains = ["store.nike.com"]
start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']
def parse(self, response):
sizes = response.xpath('//*[@class="nsg-form--drop-down exp-pdp-size-dropdown exp-pdp-dropdown two-column-dropdown"]/option')
for s in sizes:
size = s.xpath('text()[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]').extract_first('').strip()
yield{'Size':size}
结果如下:
M 4 / W 5.5
M 4.5 / W 6
M 6.5 / W 8
M 7 / W 8.5
M 7.5 / W 9
M 8 / W 9.5
M 8.5 / W 10
M 9 / W 10.5
在for循环中,如果我们这样写它,它将提取所有大小,无论它们是否有库存。
size = s.xpath('text()').extract_first('').strip()
但是如果你想获得那些只有库存的那些,它们会被标记为“exp-pdp-size-not-in-stock selectBox-disabled”,你必须通过添加它来排除它:
[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]
我已经在其他鞋页上测试了它,它也可以。
答案 1 :(得分:0)
通过AJAX调用加载大小。
因此,您必须向该AJAX URL发出另一个请求才能获取尺寸。
这是完全有效的代码。 (我没有在我身边运行代码,但我确信它正在运行)
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request
import json
class ShoesSpider(Spider):
name = "shoes"
allowed_domains = ["store.nike.com"]
start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']
def parse(self, response):
shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
for shoe in shoes:
yield Request(shoe, callback=self.parse_shoes)
def parse_shoes(self, response):
data = {}
data['name'] = response.xpath('//*[@itemprop="name"]/text()').extract_first()
data['price'] = response.xpath('//*[@itemprop="price"]/text()').extract_first()
#sizes = ??
sizes_url = "http://store.nike.com/html-services/templateData/pdpData?action=getPage&path=%2Fus%2Fen_us%2Fpd%2Fmagista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat%2Fpid-11229710%2Fpgid-11918119&productId=11229710&productGroupId=11918119&catalogId=100701&cache=true&country=US&lang_locale=en_US"
yield Request(url = sizes_url, callback=self.parse_sizes, meta={'data':data})
def parse_shoes(self, response):
resp = json.loads(response.body)
data = response.meta['data']
sizes = resp['response']['pdpData']['skuContainer']['productSkus']
sizesArray = []
for a in sizes:
sizesArray.extend([a["displaySize"]])
yield {
'name' : data['name'],
'price' : data['price'],
'sizes' : sizesArray}
注:
每个产品的sizes_url
都不同,因此您需要花一些时间来查看所需的参数。