我试图加载一个大型日志文件(13 + gb)的json对象。我能够将整个文件转储到ram中,但过程非常缓慢,我不需要同时使用所有对象。
在特定日期范围内循环浏览文件并选择json对象的最佳方法是什么?我的代码目前看起来像这样:
path = //logfile
import json
records = []
line_number = 0
file = open(path, encoding="utf8")
for line in file:
try:
line_dict = json.loads(line)
clean_dict = {"username": line_dict["username"], "language": line_dict["language"],
"key": line_dict["key"], "page_url":line_dict["page_url"],
"session": line_dict["session"], "timestamp": line_dict["timestamp"],
"value": line_dict["value"]}
records.append(clean_dict)
except (KeyError, ValueError, NameError, UnicodeDecodeError):
continue
else:
line_dict = json.loads(line)
clean_dict = {"username": line_dict["username"], "language": line_dict["language"],
"key": line_dict["key"], "page_url":line_dict["page_url"],
"session": line_dict["session"], "timestamp": line_dict["timestamp"],
"value": line_dict["value"]}
records.append(clean_dict)
file.close()
len(records)
Output: 25695470
我正在使用jupyter笔记本。