Question

我有大约50 GB的6,000个JSON文件，我目前正在使用以下方法加载到pandas数据框中。（ format_pandas 函数在读取每个JSON行时设置我的pandas数据框）：

path = '/Users/shabina.rayan/Desktop/Jupyter/Scandanavia Weather/Player  Data'
records = []
for filename in glob.glob(os.path.join(path, '*.JSON')):
    file = Path(filename)
    with open(file) as json_data:
        j = json.load(json_data)
        format_pandas(j)
pandas_json = json.dumps(records)
df = pd.read_json(pandas_json,orient="records")

可以猜到，处理我的数据需要花费极长的时间。有没有人对我可以处理50 GB的JSON文件以及可视化/分析它的任何其他方式有任何建议？

Answer 1

将其转储到Elasticsearch并根据需要运行查询。

将50 GB的JSON处理为Pandas Dataframe

1 个答案: