(Python)Scrapy - 如何刮取JS下拉列表?

时间:2017-03-04 08:59:40

标签: javascript python scrapy

我想抓一个' size'的javascript列表。这个地址的一部分:

http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119

我想要做的是获得有货的尺码,它会返回一个清单。我怎么能这样做?

这是我的完整代码:

# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request

class ShoesSpider(Spider):
    name = "shoes"
    allowed_domains = ["store.nike.com"]
    start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']

    def parse(self, response):       
        shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
        for shoe in shoes:
            yield Request(shoe, callback=self.parse_shoes) 

    def parse_shoes(self, response):
        name = response.xpath('//*[@itemprop="name"]/text()').extract_first()
        price = response.xpath('//*[@itemprop="price"]/text()').extract_first()
        #sizes = ??

        yield {
            'name' : name,
            'price' : price,
            'sizes' : sizes
        }

由于

2 个答案:

答案 0 :(得分:1)

以下是提取库存尺寸的代码。

import scrapy


class ShoesSpider(scrapy.Spider):
    name = "shoes"
    allowed_domains = ["store.nike.com"]
    start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']

    def parse(self, response):
        sizes = response.xpath('//*[@class="nsg-form--drop-down exp-pdp-size-dropdown exp-pdp-dropdown two-column-dropdown"]/option')


        for s in sizes:
            size = s.xpath('text()[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]').extract_first('').strip()
            yield{'Size':size}


结果如下:

M 4 / W 5.5
M 4.5 / W 6
M 6.5 / W 8
M 7 / W 8.5
M 7.5 / W 9
M 8 / W 9.5
M 8.5 / W 10
M 9 / W 10.5

在for循环中,如果我们这样写它,它将提取所有大小,无论它们是否有库存。

size = s.xpath('text()').extract_first('').strip()


但是如果你想获得那些只有库存的那些,它们会被标记为“exp-pdp-size-not-in-stock selectBox-disabled”,你必须通过添加它来排除它:

[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]



我已经在其他鞋页上测试了它,它也可以。

答案 1 :(得分:0)

通过AJAX调用加载大小。

因此,您必须向该AJAX URL发出另一个请求才能获取尺寸。

这是完全有效的代码。 (我没有在我身边运行代码,但我确信它正在运行)

# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request
import json

class ShoesSpider(Spider):
    name = "shoes"
    allowed_domains = ["store.nike.com"]
    start_urls = ['http://store.nike.com/us/en_us/pd/magista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat/pid-11229710/pgid-11918119']

    def parse(self, response):       
        shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
        for shoe in shoes:
            yield Request(shoe, callback=self.parse_shoes) 

    def parse_shoes(self, response):
        data = {}
        data['name'] = response.xpath('//*[@itemprop="name"]/text()').extract_first()
        data['price'] = response.xpath('//*[@itemprop="price"]/text()').extract_first()
        #sizes = ??


        sizes_url = "http://store.nike.com/html-services/templateData/pdpData?action=getPage&path=%2Fus%2Fen_us%2Fpd%2Fmagista-opus-ii-tech-craft-2-mens-firm-ground-soccer-cleat%2Fpid-11229710%2Fpgid-11918119&productId=11229710&productGroupId=11918119&catalogId=100701&cache=true&country=US&lang_locale=en_US"
        yield Request(url = sizes_url, callback=self.parse_sizes, meta={'data':data}) 


        def parse_shoes(self, response):

            resp = json.loads(response.body)

            data = response.meta['data']

            sizes = resp['response']['pdpData']['skuContainer']['productSkus']

            sizesArray = []

            for a in sizes:
                sizesArray.extend([a["displaySize"]])

            yield {
            'name' : data['name'],
            'price' : data['price'],
            'sizes' : sizesArray}

注:

每个产品的sizes_url都不同,因此您需要花一些时间来查看所需的参数。