Question

我需要分析一些JSON文件。我正在使用iPython（Python 3.5.2 | IPython 5.0.0），将文件读入字典并将每个字典附加到列表中。

我的主要瓶颈是阅读文件。有些文件较小，可以快速读取，但较大的文件会让我失望。

这是一些示例代码（抱歉，我无法提供实际的数据文件）：

import json
import glob

def read_json_files(path_to_file):
    with open(path_to_file) as p:
        data = json.load(p)
        p.close()
    return data

def giant_list(json_files):
    data_list = []
    for f in json_files:
        data_list.append(read_json_files(f))
    return data_list

support_files = glob.glob('/Users/path/to/support_tickets_*.json')
small_file_test = giant_list(support_files)

event_files = glob.glob('/Users/path/to/google_analytics_data_*.json')
large_file_test = giant_list(event_files)

支持票的尺寸非常小 - 我看到的最大支票是6KB。所以，这段代码运行得非常快：

In [3]: len(support_files)
Out[3]: 5278

In [5]: %timeit giant_list(support_files)
1 loop, best of 3: 557 ms per loop

但是更大的文件肯定会让我失望......这些事件文件每个可以达到~2.5MB：

In [7]: len(event_files) # there will be a lot more of these soon :-/
Out[7]: 397

In [8]: %timeit giant_list(event_files)
1 loop, best of 3: 14.2 s per loop

我已经研究过如何加快这个过程并遇到this post，然而，当使用UltraJSON时，时间稍差：

In [3]: %timeit giant_list(traffic_files)
1 loop, best of 3: 16.3 s per loop

SimpleJSON没有做得更好：

In [4]: %timeit giant_list(traffic_files)
1 loop, best of 3: 16.3 s per loop

非常感谢有关如何优化此代码以及更有效地将大量JSON文件读入Python的任何提示。

最后，this post是我发现的最接近我的问题，但是处理的是一个巨大的JSON文件，而不是很小的文件。

Answer 1

使用列表推导避免多次调整列表大小。

def giant_list(json_files):
    return [read_json_file(path) for path in json_files]

您正在关闭文件对象两次，只需执行一次（在退出with文件时将自动关闭）

def read_json_file(path_to_file):
    with open(path_to_file) as p:
        return json.load(p)

在一天结束时，您的问题是I / O限制，但这些更改将有所帮助。另外，我不得不问 - 你真的必须同时在记忆中包含所有这些词典吗？

用Python阅读数千个JSON文件的最快方法

1 个答案: