Question

这不是关于阅读大型JSON文件，而是以最有效的方式阅读大量JSON文件。

问题

我正在使用last.fm中的Million song dataset数据集。数据以一组JSON编码的文本文件的形式提供，其中的键是：track_id，artist，title，timestamp，similars和tags。

目前我在通过几个选项后以下列方式将它们读入大熊猫，因为这是最快的here：

import os
import pandas as pd
try:
    import ujson as json
except ImportError:
    try:
        import simplejson as json
    except ImportError:
        import json


# Path to the dataset
path = "../lastfm_train/"

# Getting list of all json files in dataset
all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')] 

data_list=[json.load(open(file)) for file in all_files]
df = pd.DataFrame(data_list, columns=['similars', 'track_id'])
df.set_index('track_id', inplace=True)

当前方法读取子集（不到一秒钟内完整数据集的1％）。然而，阅读整套列车的速度太慢而且需要永远（我也等了几个小时）才能阅读，并且已经成为question here所示的进一步任务的瓶颈。

我还在解析json文件时使用ujson来提高速度，这显然可以从this question here

中看到

更新1 使用生成器理解而不是列表理解。

data_list=(json.load(open(file)) for file in all_files)

Answer 1

如果必须对数据集执行多个IO操作，为什么不将.json文件转换为更快的IO格式？如果数据集的总大小为2.5G，读取时间不应超过一分钟，即使在标准macbook上存储为.csv文件时也是如此。

例如，pandas 0.20中的新内容是.feather格式。有关pandas作者的说明，请参阅here。在我自己对标准开发macbook的测试中，我在大约1秒内读取了1Gb文件。

另一个注意事项：我建议在顶级feather.read_data函数上使用pandas.read_feather，因为pandas版本还不允许读取列的子集。您可以下载羽毛here或只使用pip install feather。

Answer 2

我会在文件上构建一个迭代器，只需要yield你想要的两列。

然后，您可以使用该迭代器实例化DataFrame。

import os
import json
import pandas as pd

# Path to the dataset
path = "../lastfm_train/"

def data_iterator(path):
    for root, dirs, files in os.walk(path):
        for f in files:
            if f.endswith('.json'):
                fp = os.path.join(root,f)
                with open(fp) as o:
                    data = json.load(o)
                yield {"similars" : data["similars"], "track_id": data["track_id"]}


df = pd.DataFrame(data_iterator(path))
df.set_index('track_id', inplace=True)

这样您只需查看一次文件列表，就不会在将数据传递给DataFrame之前和之后复制数据

在Python中读取大量的json文件？

2 个答案: