Question

我在下面提到的文件夹结构中的不同文件夹中具有相同名称的json文件

folder1/
    file1.json
    file2.json
    file3.json
folder2/
    file1.json
    file2.json
    file3.json
    file4.json
folder3/
    file1.json
    file2.json
    file3.json
    file4.json
    file5.json
....

结合所有文件夹中可用的json文件以创建单个json文件的最佳方法是什么。 file1.json中的键在

中存在的所有文件夹中都是唯一的

到目前为止，我可以想到以下方法，但是由于每个json文件约为5 MB，因此感觉很慢。

from pathlib import Path

output_dir = Path(location_of_output_folder)
output_dir.mkdir(parents=True, exist_ok=True)

# find all the folders
root_dir = Path(root_location_for_folders)
folders = [fld for fld in root_dir.iterdir() if fld.is_dir()]

# find all the unique file names
all_filenames = []
for fld in folders:
    for f in fld.glob('*.json'):
        all_filenames.append(f.name)


## Approach 1
# Join file that possibly exists across all the folders by creating empty list
for f in list(set(all_filenames)):
    f_data = []

    for fld in folders:
        if (fld / f).is_file():
           with open(fld /f, 'r') as fp:
               f_data.append(json.load(fp))

    with open(output_dir / f, 'w') as fp:
        json.dump(f_data, fp, indent=4)


## Approach 2
# Join file that possibly exists across all the folders by creating empty dict
for f in list(set(all_filenames)):
    f_data = {}

    for fld in folders:
        if (fld / f).is_file():
           with open(fld /f, 'r') as fp:
               f_data.update(json.load(fp))

    with open(output_dir / f, 'w') as fp:
        json.dump(f_data, fp, indent=4)

是否有更好（更快）的方法。我只担心时间而对pythonic解决方案不感兴趣

谢谢

更新＃1：具有相同文件名的文件应合并。对不起，如果我不清楚。每个文件将只有几个与所有文件相似的键(l1, l2, l3, l4)

示例

a。 file1.json中的folder1的结构

a。 file2.json中的folder2的结构

Answer 1

您无需解析输入的JSON文件，而只需将它们读取为文本文件即可，这会快得多（基本上每个文件一个系统调用）。然后，通过在每个文件内容的开头添加COPY ./package.json ./，在末尾添加[，并在每个文件内容之后添加]，将它们组合为全局JSON列表。好的，这些行不会在0级列表中缩进，但是谁在乎呢？这是一个基本的实现：

请注意，此实现将输入文件一一存储在RAM中，因此与其他方法相反，很容易处理很长的文件列表。

最后一点：如果您确实要对所有内行进行缩进，则可以简单地逐行读取每个文件（在文件上使用infiles = [...] # the whole list of input JSON files outfile = 'out.json' with open(outfile,'w') as o: o.write('[') for infile in infiles[:-1]: # loop over all files except the last one with open(infile,'r') as i: o.write(i.read().strip() + ',\n') with open(infiles[-1]) as i: # special treatement for last file o.write(i.read().strip() + ']\n')方法）并添加前缀在输出文件上写入前减少4个空格。但是您会失去性能...

编辑：经过稍微修改的版本，具有更多的代码分解功能

readline()

Answer 2

这是我能想到的最简单的代码：

from glob import glob
from os import makedirs, path
from pathlib import Path
import json

# Directories
input_dir = "in"
output_file = "out/out.json"

# Get array of files
files = glob(path.join(input_dir, "**", "*.json"))

# Data object
data = {}

# Merge all files
for file in files:
    data.update(json.load(open(file)))

# Create output directory
makedirs(path.dirname(output_file), exist_ok=True)

# Dump data
json.dump(data, open(output_file, "w+"))

Answer 3

编辑：我知道该解决方案不再符合要求，我将在短期内对其进行更新。

暂时不考虑这是否很重要的问题，这就是我的想法。

import glob
import json

file_names = glob.glob('../resources/json_files/*.json')

json_list = []

for curr_f_name in file_names:
    with open(curr_f_name) as curr_f_obj:
        json_list.append(json.load(curr_f_obj))

with open('../out/json_merge_out.json', 'w') as out_file:
    json.dump(json_list, out_file, indent=4)

包含的JSON文件目录：

example_1.json：

{
    "fruit": "Apple",
    "size": "Large",
    "color": "Red"
}

example_2.json：

{
    "quiz": {
        "sport": {
            "q1": {
                "question": "Which one is correct team name in NBA?",
                "options": [
                    "New York Bulls",
                    "Los Angeles Kings",
                    "Golden State Warriros",
                    "Huston Rocket"
                ],
                "answer": "Huston Rocket"
            }
        },
        "maths": {
            "q1": {
                "question": "5 + 7 = ?",
                "options": [
                    "10",
                    "11",
                    "12",
                    "13"
                ],
                "answer": "12"
            },
            "q2": {
                "question": "12 - 8 = ?",
                "options": [
                    "1",
                    "2",
                    "3",
                    "4"
                ],
                "answer": "4"
            }
        }
    }
}

输出文件json_merge_out.json的内容：

[
    {
        "quiz": {
            "sport": {
                "q1": {
                    "question": "Which one is correct team name in NBA?",
                    "options": [
                        "New York Bulls",
                        "Los Angeles Kings",
                        "Golden State Warriros",
                        "Huston Rocket"
                    ],
                    "answer": "Huston Rocket"
                }
            },
            "maths": {
                "q1": {
                    "question": "5 + 7 = ?",
                    "options": [
                        "10",
                        "11",
                        "12",
                        "13"
                    ],
                    "answer": "12"
                },
                "q2": {
                    "question": "12 - 8 = ?",
                    "options": [
                        "1",
                        "2",
                        "3",
                        "4"
                    ],
                    "answer": "4"
                }
            }
        }
    },
    {
        "fruit": "Apple",
        "size": "Large",
        "color": "Red"
    }
]

Answer 4

如果您真的对时间感兴趣，可以直接转到C ++或C。就像@Barmar在评论中说的那样，我认为您可以对设置进行优化，因为您需要打开所有文件无论如何

将JSON文件合并为一个的最佳方法

4 个答案: