将JSON文件合并为一个的最佳方法

时间:2019-12-19 00:18:36

标签: python json

我在下面提到的文件夹结构中的不同文件夹中具有相同名称的json文件

folder1/
    file1.json
    file2.json
    file3.json
folder2/
    file1.json
    file2.json
    file3.json
    file4.json
folder3/
    file1.json
    file2.json
    file3.json
    file4.json
    file5.json
....

结合所有文件夹中可用的json文件以创建单个json文件的最佳方法是什么。 file1.json中的键在

中存在的所有文件夹中都是唯一的

到目前为止,我可以想到以下方法,但是由于每个json文件约为5 MB,因此感觉很慢。

from pathlib import Path

output_dir = Path(location_of_output_folder)
output_dir.mkdir(parents=True, exist_ok=True)

# find all the folders
root_dir = Path(root_location_for_folders)
folders = [fld for fld in root_dir.iterdir() if fld.is_dir()]

# find all the unique file names
all_filenames = []
for fld in folders:
    for f in fld.glob('*.json'):
        all_filenames.append(f.name)


## Approach 1
# Join file that possibly exists across all the folders by creating empty list
for f in list(set(all_filenames)):
    f_data = []

    for fld in folders:
        if (fld / f).is_file():
           with open(fld /f, 'r') as fp:
               f_data.append(json.load(fp))

    with open(output_dir / f, 'w') as fp:
        json.dump(f_data, fp, indent=4)


## Approach 2
# Join file that possibly exists across all the folders by creating empty dict
for f in list(set(all_filenames)):
    f_data = {}

    for fld in folders:
        if (fld / f).is_file():
           with open(fld /f, 'r') as fp:
               f_data.update(json.load(fp))

    with open(output_dir / f, 'w') as fp:
        json.dump(f_data, fp, indent=4)

是否有更好(更快)的方法。我只担心时间而对pythonic解决方案不感兴趣

谢谢

更新#1:具有相同文件名的文件应合并。对不起,如果我不清楚。每个文件将只有几个与所有文件相似的键(l1, l2, l3, l4)

示例

a。 file1.json中的folder1的结构

{
    k1: {
           l1: 11,
           l2: 12,
           l3: 13,
           l4: 14,
        },

    k2: {
           l1: 21,
           l2: 22,
           l3: 23,
           l4: 24,
        }
    .....
}

a。 file2.json中的folder2的结构

{
    k8: {
           l1: 41,
           l2: 42,
           l3: 43,
           l4: 44,
        },

    k9: {
           l1: 51,
           l2: 52,
           l3: 53,
           l4: 54,
        }
    .....
}

4 个答案:

答案 0 :(得分:1)

您无需解析输入的JSON文件,而只需将它们读取为文本文件即可,这会快得多(基本上每个文件一个系统调用)。然后,通过在每个文件内容的开头添加COPY ./package.json ./,在末尾添加[,并在每个文件内容之后添加],将它们组合为全局JSON列表。好的,这些行不会在0级列表中缩进,但是谁在乎呢?这是一个基本的实现:

,

请注意,此实现将输入文件一一存储在RAM中,因此与其他方法相反,很容易处理很长的文件列表。

最后一点:如果您确实要对所有内行进行缩进,则可以简单地逐行读取每个文件(在文件上使用infiles = [...] # the whole list of input JSON files outfile = 'out.json' with open(outfile,'w') as o: o.write('[') for infile in infiles[:-1]: # loop over all files except the last one with open(infile,'r') as i: o.write(i.read().strip() + ',\n') with open(infiles[-1]) as i: # special treatement for last file o.write(i.read().strip() + ']\n') 方法)并添加前缀 在输出文件上写入前减少4个空格。但是您会失去性能...

编辑:经过稍微修改的版本,具有更多的代码分解功能

readline()

答案 1 :(得分:0)

这是我能想到的最简单的代码:

from glob import glob
from os import makedirs, path
from pathlib import Path
import json

# Directories
input_dir = "in"
output_file = "out/out.json"

# Get array of files
files = glob(path.join(input_dir, "**", "*.json"))

# Data object
data = {}

# Merge all files
for file in files:
    data.update(json.load(open(file)))

# Create output directory
makedirs(path.dirname(output_file), exist_ok=True)

# Dump data
json.dump(data, open(output_file, "w+"))

答案 2 :(得分:0)

编辑:我知道该解决方案不再符合要求,我将在短期内对其进行更新。

暂时不考虑这是否很重要的问题,这就是我的想法。

import glob
import json

file_names = glob.glob('../resources/json_files/*.json')

json_list = []

for curr_f_name in file_names:
    with open(curr_f_name) as curr_f_obj:
        json_list.append(json.load(curr_f_obj))

with open('../out/json_merge_out.json', 'w') as out_file:
    json.dump(json_list, out_file, indent=4)

包含的JSON文件目录:

example_1.json

{
    "fruit": "Apple",
    "size": "Large",
    "color": "Red"
}

example_2.json

{
    "quiz": {
        "sport": {
            "q1": {
                "question": "Which one is correct team name in NBA?",
                "options": [
                    "New York Bulls",
                    "Los Angeles Kings",
                    "Golden State Warriros",
                    "Huston Rocket"
                ],
                "answer": "Huston Rocket"
            }
        },
        "maths": {
            "q1": {
                "question": "5 + 7 = ?",
                "options": [
                    "10",
                    "11",
                    "12",
                    "13"
                ],
                "answer": "12"
            },
            "q2": {
                "question": "12 - 8 = ?",
                "options": [
                    "1",
                    "2",
                    "3",
                    "4"
                ],
                "answer": "4"
            }
        }
    }
}

输出文件json_merge_out.json的内容:

[
    {
        "quiz": {
            "sport": {
                "q1": {
                    "question": "Which one is correct team name in NBA?",
                    "options": [
                        "New York Bulls",
                        "Los Angeles Kings",
                        "Golden State Warriros",
                        "Huston Rocket"
                    ],
                    "answer": "Huston Rocket"
                }
            },
            "maths": {
                "q1": {
                    "question": "5 + 7 = ?",
                    "options": [
                        "10",
                        "11",
                        "12",
                        "13"
                    ],
                    "answer": "12"
                },
                "q2": {
                    "question": "12 - 8 = ?",
                    "options": [
                        "1",
                        "2",
                        "3",
                        "4"
                    ],
                    "answer": "4"
                }
            }
        }
    },
    {
        "fruit": "Apple",
        "size": "Large",
        "color": "Red"
    }
]

答案 3 :(得分:-1)

如果您真的对时间感兴趣,可以直接转到C ++或C。就像@Barmar在评论中说的那样,我认为您可以对设置进行优化,因为您需要打开所有文件无论如何

相关问题