如果使用Javascript进行分页,如何在网站上抓取数据

时间:2019-04-16 08:15:51

标签: python web-scraping scrapy splash scrapinghub

我有一个网站需要抓取数据 “ https://www.forever21.com/us/shop/catalog/category/f21/sale#pageno=1&pageSize=120&filter=price:0,250&sort=5”,但我无法检索它也具有分页功能的所有数据,并且它也使用javascript。

关于我将如何刮除所有物品的任何想法?这是我的代码

def parse_2(self, response):


    for product_item_forever in response.css('div.pi_container'):
        item = GpdealsSpiderItem_f21()

        f21_title = product_item_forever.css('p.p_name::text').extract_first()
        f21_regular_price = product_item_forever.css('span.p_old_price::text').extract_first()
        f21_sale_price = product_item_forever.css('span.p_sale.t_pink::text').extract_first()
        f21_photo_url = product_item_forever.css('img::attr(data-original)').extract_first()
        f21_description_url = product_item_forever.css('a.item_slider.product_link::attr(href)').extract_first()

        item['f21_title'] = f21_title 
        item['f21_regular_price'] = f21_regular_price 
        item['f21_sale_price'] = f21_sale_price 
        item['f21_photo_url'] = f21_photo_url 
        item['f21_description_url'] = f21_description_url 

        yield item

请帮助谢谢

1 个答案:

答案 0 :(得分:0)

网络抓取项目的第一步之一就是寻找网站用来获取数据的API。使用API​​不仅可以节省您解析HTML的时间,还可以节省提供商的带宽和服务器负载。要查找API,请使用浏览器的开发人员工具,然后在“网络”标签中查找XHR请求。对于您而言,该网站对此URL发出POST请求:

https://www.forever21.com/eu/shop/Catalog/GetProducts

然后,您可以在Scrapy中模拟XHR请求以获取JSON格式的数据。这是蜘蛛的代码:

# -*- coding: utf-8 -*-
import json
import scrapy

class Forever21Spider(scrapy.Spider):
    name = 'forever21'

    url = 'https://www.forever21.com/eu/shop/Catalog/GetProducts'
    payload = {
        'brand': 'f21',
        'category': 'sale',
        'page': {'pageSize': 60},
        'filter': {
            'price': {'minPrice': 0, 'maxPrice': 250}
        },
        'sort': {'sortType': '5'}
    }

    def start_requests(self):
        # scrape the first page
        payload = self.payload.copy()
        payload['page']['pageNo'] = 1
        yield scrapy.Request(
            self.url, method='POST', body=json.dumps(payload),
            headers={'X-Requested-With': 'XMLHttpRequest',
                     'Content-Type': 'application/json; charset=UTF-8'},
            callback=self.parse, meta={'pageNo': 1}
        )

    def parse(self, response):
        # parse the JSON response and extract the data
        data = json.loads(response.text)
        for product in data['CatalogProducts']:
            item = {
                'title': product['DisplayName'],
                'regular_price': product['OriginalPrice'],
                'sale_price': product['ListPrice'],
                'photo_url': 'https://www.forever21.com/images/default_330/%s' % product['ImageFilename'],
                'description_url': product['ProductShareLinkUrl']
            }
            yield item

        # simulate pagination if we are not at the end
        if len(data['CatalogProducts']) == self.payload['page']['pageSize']:
            payload = self.payload.copy()
            payload['page']['pageNo'] = response.meta['pageNo'] + 1
            yield scrapy.Request(
                self.url, method='POST', body=json.dumps(payload),
                headers={'X-Requested-With': 'XMLHttpRequest',
                         'Content-Type': 'application/json; charset=UTF-8'},
                callback=self.parse, meta={'pageNo': payload['page']['pageNo']}
            )