我有10个JSONL文件,每个文件有50,000个条目,并且它们都遵循相同的结构:
{"tracking_code":"28847594","from_country":"FR","to_country":"FR","package_type_id":10,"transaction_id":168491899,"shipping_label_created":"2018-09-18 11:57:12"}
{"tracking_code":"28847594","from_country":"FR","to_country":"FR","package_type_id":10,"transaction_id":168491899,"shipping_label_created":"2018-09-18 11:57:12"}
{"tracking_code":"28847594","from_country":"FR","to_country":"FR","package_type_id":10,"transaction_id":168491899,"shipping_label_created":"2018-09-18 11:57:12"}
...
我想使用Python将10个文件合并为一个文件。
所需的输出将是包含1个JSONL文件以及10个文件中的所有条目。只需将一个文件添加到另一个文件即可。
答案 0 :(得分:2)
您需要做的就是连接文件;无需解码,甚至无需确认每个文件中的JSON对象。
from contextlib import ExitStack
from itertools import chain
filenames = ["file1.json", "file2.json", "file3.json", ...]
with ExitStack() as stack, open("file.json") as out:
files = [stack.enter_context(open(fname)) for fname in filenames]
for line in chain.from_iterable(files):
print(line, file=out)
ExitStack
收集应调用其close
方法的打开文件句柄。 chain.from_iterable
使您可以一个接一个地迭代打开的文件。
这基本上是cat
命令的Python重新实现,您也可以直接调用它:
import subprocess
# cat file1.json file2.json file3.json ... > file.json
filenames = ["file1.json", "file2.json", "file3.json", ...]
with open("file.json", "w") as out:
subprocess.run(['cat'] + filenames, stdout=out)