Question

我有以下代码

import scrapy
import re


class NamePriceSpider(scrapy.Spider):
    name = 'namePrice'
    start_urls = [
        'https://www.cotodigital3.com.ar/sitios/cdigi/browse/'
    ]

    def parse(self, response):
        all_category_products = response.xpath('//*[@id="products"]')
        for product in all_category_products:
            name = product.xpath('//div[@class="descrip_full"]/text()').extract()
            price = product.xpath('//span[@class="atg_store_productPrice" and not(@style)]/span[@class '
                                  '="atg_store_newPrice"]/text() | //span[@class="price_discount"]/text()').re(
                r'\$\d{'
                r'1,'
                r'5}(?:['
                r'.,'
                r']\d{'
                r'3})*('
                r'?:[., '
                r']\d{2})*')

            yield {'name': name,
                   'price': price}

            next_page = response.xpath('//a[@title = "Siguiente"]/@href').extract_first()
            next_page = response.urljoin(next_page)

            if next_page:
                yield scrapy.Request(url=next_page, callback=self.parse)

效果很好，可以在超市网站的多个页面中刮取产品名称和价格。我遇到的问题是，当我将所有信息输出到json文件中时，有不同的结构，例如{“ name”：[“ a”，“ b”，“ c”]，“ price”：[“ 10 “，” 20，“ 30”]}（一页）和{“ name”：[“ d”，“ f”，“ g”]，“ price”：[“ 40”，“ 50，” 60“]}对于其他页面。我希望所有页面都有一个结构，这样更容易迭代：{“ name”：[“ a”，“ b”，“ c”，“ d”，“ f”，“ g”]， “价格”：[“ 10”，“ 20，” 30“，” 40“，” 50，“ 60”]}。有没有办法做到这一点？

以相同的结构刮擦多个页面

0 个答案: