我有两个大小为data_large(150.1mb)
和data_small(7.5kb)
的json文件。每个文件中的内容都是[{"score": 68},{"score": 78}]
类型。我需要从每个文件中找到唯一分数列表。
在处理 data_small 时,我执行了以下操作,并且能够使用0.1 secs
查看其内容。
with open('data_small') as f:
content = json.load(f)
print content # I'll be applying the logic to find the unique values later.
但是在处理 data_large 时,我做了以下操作,我的系统被绞死,缓慢,不得不强制关闭它以使其达到正常速度。花了2 mins
来打印其内容。
with open('data_large') as f:
content = json.load(f)
print content # I'll be applying the logic to find the unique values later.
如何在处理大型数据集时提高程序的效率?
答案 0 :(得分:3)
由于您的json文件不是那么大,您可以一次性将它打开到ram中,您可以获得所有独特的值,如:
with open('data_large') as f:
content = json.load(f)
# do not print content since it prints it to stdout which will be pretty slow
# get the unique values
values = set()
for item in content:
values.add(item['score'])
# the above uses less memory compared to this
# since this has to create another array with all values
# and then filter it for unique values
values = set([i['score'] for i in content])
# its faster to save the results to a file rather than print them
with open('results.json', 'wb') as fid:
# json cant serialize sets hence conversion to list
json.dump(list(values), fid)
如果你需要处理更大的文件,那么找一下可以迭代解析json文件的其他库。
答案 1 :(得分:0)
如果你想在较小的块中迭代JSON文件以保留RAM,我建议采用下面的方法,根据你的评论,你不想使用ijson来做到这一点。这只能起作用,因为您的示例输入数据非常简单,并且包含一个带有一个键和一个值的字典数组。对于更复杂的数据会更复杂,我会在那时使用实际的JSON流库。
import json
bytes_to_read = 10000
unique_scores = set()
with open('tmp.txt') as f:
chunk = f.read(bytes_to_read)
while chunk:
# Find indices of dictionaries in chunk
if '{' not in chunk:
break
opening = chunk.index('{')
ending = chunk.rindex('}')
# Load JSON and set scores.
score_dicts = json.loads('[' + chunk[opening:ending+1] + ']')
for s in score_dicts:
unique_scores.add(s.values()[0])
# Read next chunk from last processed dict.
f.seek(-(len(chunk) - ending) + 1, 1)
chunk = f.read(bytes_to_read)
print unique_scores