我希望为无法加载到内存的非常大的JSON文件(〜1TB)实现流json解析器。一种选择是使用类似https://github.com/stedolan/jq的东西将文件转换为json-newline分隔的,但是我还需要对每个json对象做很多其他事情,这使这种方法不理想。
给出一个非常大的json对象,我将如何逐个对象地对其进行解析,类似于xml中的这种方法:https://www.ibm.com/developerworks/library/x-hiperfparse/index.html。
例如,使用伪代码:
with open('file.json','r') as f:
json_str = ''
for line in f: # what if there are no newline in the json obj?
json_str += line
if is_valid(json_str):
obj = json.loads(json_str)
do_something()
json_str = ''
此外,我发现jq -c
并没有特别快(忽略内存考虑)。例如,进行json.loads
与使用jq -c
一样快(并且快一点)。我也尝试使用ujson
,但一直收到损坏错误,我认为这与文件大小有关。
# file size is 2.2GB
>>> import json,time
>>> t0=time.time();_=json.loads(open('20190201_itunes.txt').read());print (time.time()-t0)
65.6147990227
$ time cat 20190206_itunes.txt|jq -c '.[]' > new.json
real 1m35.538s
user 1m25.109s
sys 0m15.205s
最后,这是一个示例100KB json输入,可用于测试:https://hastebin.com/ecahufonet.json
答案 0 :(得分:1)
如果文件包含一个大的JSON对象(数组或映射),则根据JSON规范,必须先读取整个对象,然后才能访问其组件。
例如,如果文件是一个包含对象[ {...}, {...} ]
的数组,则换行符分隔的JSON效率要高得多,因为您一次只需要在内存中保留一个对象,而解析器则只需要读取一行即可。可以开始处理。
如果您需要跟踪某些对象以供以后在解析过程中使用,我建议创建一个dict
来保存这些特定的运行值记录,以便在文件迭代时保留它们。
说你有JSON
{"timestamp": 1549480267882, "sensor_val": 1.6103881016325283}
{"timestamp": 1549480267883, "sensor_val": 9.281329310309406}
{"timestamp": 1549480267883, "sensor_val": 9.357327083443344}
{"timestamp": 1549480267883, "sensor_val": 6.297722749124474}
{"timestamp": 1549480267883, "sensor_val": 3.566667175421604}
{"timestamp": 1549480267883, "sensor_val": 3.4251473635178655}
{"timestamp": 1549480267884, "sensor_val": 7.487766674770563}
{"timestamp": 1549480267884, "sensor_val": 8.701853236245032}
{"timestamp": 1549480267884, "sensor_val": 1.4070662393018396}
{"timestamp": 1549480267884, "sensor_val": 3.6524325449499995}
{"timestamp": 1549480455646, "sensor_val": 6.244199614422415}
{"timestamp": 1549480455646, "sensor_val": 5.126780276231609}
{"timestamp": 1549480455646, "sensor_val": 9.413894020722314}
{"timestamp": 1549480455646, "sensor_val": 7.091154829208067}
{"timestamp": 1549480455647, "sensor_val": 8.806417239029447}
{"timestamp": 1549480455647, "sensor_val": 0.9789474417767674}
{"timestamp": 1549480455647, "sensor_val": 1.6466189633300243}
您可以使用
进行处理import json
from collections import deque
# RingBuffer from https://www.daniweb.com/programming/software-development/threads/42429/limit-size-of-a-list
class RingBuffer(deque):
def __init__(self, size):
deque.__init__(self)
self.size = size
def full_append(self, item):
deque.append(self, item)
# full, pop the oldest item, left most item
self.popleft()
def append(self, item):
deque.append(self, item)
# max size reached, append becomes full_append
if len(self) == self.size:
self.append = self.full_append
def get(self):
"""returns a list of size items (newest items)"""
return list(self)
def proc_data():
# Declare some state management in memory to keep track of whatever you want
# as you iterate through the objects
metrics = {
'latest_timestamp': 0,
'last_3_samples': RingBuffer(3)
}
with open('test.json', 'r') as infile:
for line in infile:
# Load each line
line = json.loads(line)
# Do stuff with your running metrics
metrics['last_3_samples'].append(line['sensor_val'])
if line['timestamp'] > metrics['latest_timestamp']:
metrics['latest_timestamp'] = line['timestamp']
return metrics
print proc_data()
答案 1 :(得分:0)
考虑将此json转换为文件系统树(文件夹和文件), 这样,每个json对象都将转换为包含文件的文件夹:
properties_000000002.txt
....
每个properties_X.txt文件最多包含N行(数量有限)
property_name: property_value
:
folder_0000001,folder_000002-本地文件夹的名称
每个数组都将转换为包含文件的文件夹:
elements_0000000002.txt
....