Question

我正在尝试加载一个大的JSON文件（300MB）来用来解析excel。当我执行json.load（文件）时，我刚开始遇到MemoryError。与此类似的问题已经发布，但未能回答我的具体问题。我希望能够像在代码中一样在一个块中返回json文件中的所有数据。最好的方法是什么？ Code和json结构如下：

代码看起来像这样。

def parse_from_file(filename):
    """ proceed to load the json file that given and verified,
    it and returns the data that was in the json file so it can actually be read
    Args: 
        filename (string): full branch location, used to grab the json file plus '_metrics.json'
    Returns: 
        data: whatever data is being loaded from the json file
    """

    print("STARTING PARSE FROM FILE")
    with open(filename) as json_file:    
        d = json.load(json_file)
        json_file.close()
        return d

结构看起来像这样。

[
    {
        "analysis_type": "test_one",
        "date": 1505900472.25, 
        "_id": "my_id_1.1.1",
        "content": {
            .
            .
            .
        }
    },
    {
        "analysis_type": "test_two",
        "date": 1605939478.91,
        "_id": "my_id_1.1.2",
        "content": {
            .
            .
            .
        }
    },

    .
    .
    .
]

在“内容”内部，信息不一致，但有3个不同但可能不同的模板，可以根据analysis_type进行预测。

Answer 1

如果所有测试的库都给你内存问题，我的方法是将文件分成每个对象内的一个对象。

如果文件中有新行和填充，就像你在线上逐行读取的那样，在每次找到{时，丢弃[或]是否将行写入新文件{1}}您还需要删除逗号。然后尝试加载每个文件并在结束阅读每个文件时打印一条消息，看它是否失败，如果有的话。

如果文件没有换行符或没有正确填充，则需要开始通过char保持读取char来计算{$ 1}}或},并在你找到[或{时减少它们分别找到]或}。还要考虑到您可能需要丢弃字符串中的任何卷曲或方括号，但可能不需要。

Answer 2

我确实喜欢这种方式，希望它会对你有所帮助。也许你需要跳过第1行“[”。如果存在“}，”。

，在行尾删除“，”

with open(file) as f:
    for line in f:
        while True:
            try:
                jfile = ujson.loads(line)
                break
            except ValueError:
                # Not yet a complete JSON value
                line += next(f)
        # do something with jfile

Python：Json.load大型json文件MemoryError

2 个答案: