Question

我试图加载一个大型日志文件（13 + gb）的json对象。我能够将整个文件转储到ram中，但过程非常缓慢，我不需要同时使用所有对象。

在特定日期范围内循环浏览文件并选择json对象的最佳方法是什么？我的代码目前看起来像这样：

path = //logfile


import json

records = []
line_number = 0
file = open(path, encoding="utf8")
for line in file:
    try:
        line_dict = json.loads(line)
        clean_dict = {"username": line_dict["username"], "language": line_dict["language"],
                     "key": line_dict["key"], "page_url":line_dict["page_url"],
                     "session": line_dict["session"], "timestamp": line_dict["timestamp"],
                     "value": line_dict["value"]}
        records.append(clean_dict)
    except (KeyError, ValueError, NameError, UnicodeDecodeError):
        continue
    else:
        line_dict = json.loads(line)
        clean_dict = {"username": line_dict["username"], "language": line_dict["language"],
                     "key": line_dict["key"], "page_url":line_dict["page_url"],
                     "session": line_dict["session"], "timestamp": line_dict["timestamp"],
                     "value": line_dict["value"]}
        records.append(clean_dict)
file.close()
len(records)

Output: 25695470

我正在使用jupyter笔记本。

从python 3.x中的大型日志文件中选择特定的json对象

0 个答案: