Question

我有10个JSONL文件，每个文件有50,000个条目，并且它们都遵循相同的结构：

{"tracking_code":"28847594","from_country":"FR","to_country":"FR","package_type_id":10,"transaction_id":168491899,"shipping_label_created":"2018-09-18 11:57:12"}
{"tracking_code":"28847594","from_country":"FR","to_country":"FR","package_type_id":10,"transaction_id":168491899,"shipping_label_created":"2018-09-18 11:57:12"}
{"tracking_code":"28847594","from_country":"FR","to_country":"FR","package_type_id":10,"transaction_id":168491899,"shipping_label_created":"2018-09-18 11:57:12"}
...

我想使用Python将10个文件合并为一个文件。

所需的输出将是包含1个JSONL文件以及10个文件中的所有条目。只需将一个文件添加到另一个文件即可。

Answer 1

您需要做的就是连接文件；无需解码，甚至无需确认每个文件中的JSON对象。

from contextlib import ExitStack
from itertools import chain

filenames = ["file1.json", "file2.json", "file3.json", ...]
with ExitStack() as stack, open("file.json") as out:
    files = [stack.enter_context(open(fname)) for fname in filenames]
    for line in chain.from_iterable(files):
        print(line, file=out)

ExitStack收集应调用其close方法的打开文件句柄。 chain.from_iterable使您可以一个接一个地迭代打开的文件。

这基本上是cat命令的Python重新实现，您也可以直接调用它：

import subprocess

# cat file1.json file2.json file3.json ... > file.json
filenames = ["file1.json", "file2.json", "file3.json", ...]
with open("file.json", "w") as out:
    subprocess.run(['cat'] + filenames, stdout=out)

使用Python将多个JSONL文件合并为一个大文件

1 个答案: