将以下json转换为csv

时间:2019-05-23 06:03:30

标签: python json python-3.x csv

我想使用Python将3 GB的JSON数据转换为CSV格式。我编写的代码将数据转换为CSV,但将其存储在单个单元格中。我不想要“相关”字段(使用re删除)。

JSON格式

{'asin': '0001048791', 'salesRank': {'Books': 6334800}, 'imUrl': 'http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg', 'categories': [['Books']], 'title': 'The Crucible: Performed by Stuart Pankin, Jerome Dempsey & Cast'}
{'asin': '0000143561', 'categories': [['Movies & TV', 'Movies']], 'description': '3Pack DVD set - Italian Classics, Parties and Holidays.', 'title': 'Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays', 'price': 12.99, 'salesRank': {'Movies & TV': 376041}, 'imUrl': 'http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif', 'related': {'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC', 'B002I5GNVU', 'B000RBU4BM'], 'buy_after_viewing': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC']}}
{'asin': '0000037214', 'related': {'also_viewed': ['B00JO8II76', 'B00DGN4R1Q', 'B00E1YRI4C']}, 'title': 'Purple Sequin Tiny Dancer Tutu Ballet Dance Fairy Princess Costume Accessory', 'price': 6.99, 'salesRank': {'Clothing': 1233557}, 'imUrl': 'http://ecx.images-amazon.com/images/I/31mCncNuAZL.jpg', 'brand': 'Big Dreams', 'categories': [['Clothing, Shoes & Jewelry', 'Girls'], ['Clothing, Shoes & Jewelry', 'Novelty, Costumes & More', 'Costumes & Accessories', 'More Accessories', 'Kids & Baby']]}
{ "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", "B00613WDTQ", "B00D0WDS9A", "B00D0GCI8S", "0000031895", "B003AVKOP2", "B003AVEU6G", "B003IEDM9Q", "B002R0FA24", "B00D23MC6W", "B00D2K0PA0", "B00538F5OK", "B00CEV86I6", "B002R0FABA", "B00D10CLVW", "B003AVNY6I", "B002GZGI4E", "B001T9NUFS", "B002R0F7FE", "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "B00AFDOPDA", "B00E1YRI4C", "B002GZGI4E", "B003AVKOP2", "B00D9C1WBM", "B00CEV8366", "B00CEUX0D8", "B0079ME3KU", "B00CEUWY8K", "B004FOEEHC", "0000031895", "B00BC4GY9Y", "B003XRKA7A", "B00K18LKX2", "B00EM7KAG6", "B00AMQ17JA", "B00D9C32NI", "B002C3Y6WG", "B00JLL4L5Y", "B003AVNY6I", "B008UBQZKU", "B00D0WDS9A", "B00613WDTQ", "B00538F5OK", "B005C4Y4F6", "B004LHZ1NY", "B00CPHX76U", "B00CEUWUZC", "B00IJVASUE", "B00GOR07RE", "B00J2GTM0W", "B00JHNSNSM", "B003IEDM9Q", "B00CYBU84G", "B008VV8NSQ", "B00CYBULSO", "B00I2UHSZA", "B005F50FXC", "B007LCQI3S", "B00DP68AVW", "B009RXWNSI", "B003AVEU6G", "B00HSOJB9M", "B00EHAGZNA", "B0046W9T8C", "B00E79VW6Q", "B00D10CLVW", "B00B0AVO54", "B00E95LC8Q", "B00GOR92SO", "B007ZN5Y56", "B00AL2569W", "B00B608000", "B008F0SMUC", "B00BFXLZ8M"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]] } 

import json
import re 
with open('metadata.json') as f:
    x = f.read()
    data = re.sub("\'related\': {","",x)
    data = re.sub("]},", "]", data)
    data = re.sub("}\n{", "},\n{", data)
    print("1")
    print(type(data))
    data = json.dumps(data)
    print("2")
    print(type(data))
    data = json.loads("[" + data + "]")
    print("3")
    print(type(data))
    print("done")


import pandas 
pandas.read_json(data.to_csv('metadata.csv')

我希望包含域[asin, title, categories, price, also_viewed, also_bought, brand]的CSV文件正确

1 个答案:

答案 0 :(得分:0)

好的,如果我对您的理解正确,那么您就有一个JSON文件,其中包含包含所提及字段的对象数组。您要将许多这些字段存储到CSV文件中。 JSON数据包含字段“ related”,其中包含子字段“ also_viewed”和“ also_bought”。您希望后两个子字段在CSV文件中作为单独的列,并且不希望在CSV文件中具有“相关”列。对?如果我误解了您,请尝试澄清您的问题。

Python可以使用标准库模块(jsoncsv)处理JSON和CSV数据。但是,您将必须自己编写“胶合逻辑”。我不建议对JSON数据执行正则表达式,因为它很容易出错。此外,您当前的代码对数据执行多项操作,并且在内存中具有数据的多个副本,因此它可能很慢,并且很快就会用完内存。

以下是基于标准Python模块的工作示例。但是,我不知道您输入数据的格式(问题中的JSON数据不是有效的JSON格式!),我也不知道您要输出列表的确切格式。我做了一些假设,但是如果期望使用不同的输出格式,则可能需要更改代码。但是代码确实输出有效的CSV数据,可以在电子表格应用程序(例如LibreOffice Calc)中打开该数据。

请注意,此版本还会一次加载JSON文件。它比当前实现的内存效率要高得多,但是如果JSON文件为3 GB,仍然会消耗大量内存。如果您有足够的内存,则不必担心(只要您使用的是64位版本的Python)。但是,您还可以基于流式JSON解析器实现解析器(我已经使用ijson来运行它了,但是它的代码更多了,我不确定是否应该将其发布为答案)。这将大大减少内存消耗。 CSV文件已被逐行写入,因此不需要太多内存。

所以,现在代码:-)

我认为您的JSON数据如下:

[
{"asin": "0001048791", "salesRank": {"Books": 6334800}, "imUrl": "http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg", "categories": [["Books"]], "title": "The Crucible: Performed by Stuart Pankin, Jerome Dempsey & Cast"},
{"asin": "0000143561", "categories": [["Movies & TV", "Movies"]], "description": "3Pack DVD set - Italian Classics, Parties and Holidays.", "title": "Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays", "price": 12.99, "salesRank": {"Movies & TV": 376041}, "imUrl": "http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif", "related": {"also_viewed": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC", "B002I5GNVU", "B000RBU4BM"], "buy_after_viewing": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC"]}},
{"asin": "0000037214", "related": {"also_viewed": ["B00JO8II76", "B00DGN4R1Q", "B00E1YRI4C"]}, "title": "Purple Sequin Tiny Dancer Tutu Ballet Dance Fairy Princess Costume Accessory", "price": 6.99, "salesRank": {"Clothing": 1233557}, "imUrl": "http://ecx.images-amazon.com/images/I/31mCncNuAZL.jpg", "brand": "Big Dreams", "categories": [["Clothing, Shoes & Jewelry", "Girls"], ["Clothing, Shoes & Jewelry", "Novelty, Costumes & More", "Costumes & Accessories", "More Accessories", "Kids & Baby"]]},
{"asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", "B00613WDTQ", "B00D0WDS9A", "B00D0GCI8S", "0000031895", "B003AVKOP2", "B003AVEU6G", "B003IEDM9Q", "B002R0FA24", "B00D23MC6W", "B00D2K0PA0", "B00538F5OK", "B00CEV86I6", "B002R0FABA", "B00D10CLVW", "B003AVNY6I", "B002GZGI4E", "B001T9NUFS", "B002R0F7FE", "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "B00AFDOPDA", "B00E1YRI4C", "B002GZGI4E", "B003AVKOP2", "B00D9C1WBM", "B00CEV8366", "B00CEUX0D8", "B0079ME3KU", "B00CEUWY8K", "B004FOEEHC", "0000031895", "B00BC4GY9Y", "B003XRKA7A", "B00K18LKX2", "B00EM7KAG6", "B00AMQ17JA", "B00D9C32NI", "B002C3Y6WG", "B00JLL4L5Y", "B003AVNY6I", "B008UBQZKU", "B00D0WDS9A", "B00613WDTQ", "B00538F5OK", "B005C4Y4F6", "B004LHZ1NY", "B00CPHX76U", "B00CEUWUZC", "B00IJVASUE", "B00GOR07RE", "B00J2GTM0W", "B00JHNSNSM", "B003IEDM9Q", "B00CYBU84G", "B008VV8NSQ", "B00CYBULSO", "B00I2UHSZA", "B005F50FXC", "B007LCQI3S", "B00DP68AVW", "B009RXWNSI", "B003AVEU6G", "B00HSOJB9M", "B00EHAGZNA", "B0046W9T8C", "B00E79VW6Q", "B00D10CLVW", "B00B0AVO54", "B00E95LC8Q", "B00GOR92SO", "B007ZN5Y56", "B00AL2569W", "B00B608000", "B008F0SMUC", "B00BFXLZ8M"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]] }
]

Python解决方案:

import csv
import json

CSV_COLUMNS = ['asin', 'title', 'categories', 'price',
               'also_bought', 'also_viewed', 'brand']


class DataConverter:

    def __init__(self, filename):
        self.filename = filename
        self.writer = None

    def _get_categories(self, item):
        groups = item.get('categories', [[]])
        formatted_groups = []
        for group in groups:
            formatted_groups.append('(' + ', '.join(group) + ')')
        return ', '.join(formatted_groups)

    def _get_also_bought(self, item):
        related = item.get('related', {})
        also_bought = related.get('also_bought', [])
        return ', '.join(also_bought)

    def _get_also_viewed(self, item):
        related = item.get('related', {})
        also_viewed = related.get('also_viewed', [])
        return ', '.join(also_viewed)

    def _process(self, item):
        """Process one data item.

        Expects a Python representation of a JSON object.
        Returns a list of strings (one row of the CSV file).
        """
        asin = item.get('asin', '')
        title = item.get('title', '')
        categories = self._get_categories(item)
        price = item.get('price', '')
        also_bought = self._get_also_bought(item)
        also_viewed = self._get_also_viewed(item)
        brand = item.get('brand', '')
        return [asin, title, categories, price, also_bought, also_viewed,
                brand]

    def convert(self):
        """Convert the JSON file to a CSV file."""
        # Open the JSON file.
        with open(self.filename + '.json', 'rb') as json_file:
            # Parse the JSON data.
            data = json.load(json_file)
            # Open the CSV file.
            with open(self.filename + '.csv', 'wt', newline='') as csv_file:
                # Create the CSV writer object.
                self.writer = csv.writer(csv_file, delimiter=',')
                # Write the header row.
                self.writer.writerow(CSV_COLUMNS)
                # Write CSV rows, one by one.
                for item in data:
                    row = self._process(item)
                    self.writer.writerow(row)


if __name__ == '__main__':
    DataConverter('metadata').convert()