Question

我有一个脚本可以将从API返回的数据作为JSON对象缓存到平面文件中。每行一个结果/ JSON对象。

缓存工作流程如下：

读入整个缓存文件 - ＆gt; 逐行检查每条线是否太旧 - ＆gt;将那些不太旧的列表保存到新列表中 - ＆gt; 将新的新缓存列表打印到文件中，并使用新列表作为过滤器，无法处理API调用的传入数据。

到目前为止，这个过程中最长的部分是粗体。这是代码：

print "Reading cache file into memory ---"
with open('cache', 'r') as f:
    cache_lines = f.readlines()

print "Turning cache lines into json and checking if they are stale or not ---"
for line in cache_lines
    # Load the line back up as a json object
    try:
        json_line = json.loads(line)
    except Exception as e:
        print e

    # Get the delta to determine if data is stale.
    delta = meta_dict["timestamp_start"] - parser.parse(json_line['timestamp_start'])

    # If the data is still fresh then hold onto it
    if cache_timeout >= delta:
        fresh_cache.append(json_line)

可能需要几分钟，具体取决于哈希文件的大小。有更快的方法吗？我理解首先读取整个文件并不理想，但最容易实现。

Answer 1

根据您的文件大小，可能会导致内存问题。我不知道你遇到的那种问题。以前的代码可以像这样重写：

delta = meta_dict['timestamp_start']
with open('cache', 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        line = json.loads(line)
        if delta - parser.parse(line['timestamp_start']) <= cache_timeout:
            fresh_cache.append(json_line)

另外，

如果您使用dateutils来解析日期，那么每次调用可能都很昂贵。如果您的格式已知，则可能需要使用datetime或dateutils
如果您的文件非常大并且fresh_cache必须非常大，您可以使用另一个with语句在中间文件上编写 fresh 条目。

Answer 2

报告 - 简单的几乎没有效果。 2.进行手动日期时间提取效果很好。它减少了时间从8m11.578s到2m55.681s。这取代了parser.parse 从上面的行： datetime.datetime.strptime（json_line [＆＃39; timestamp_start＆＃39;]，＆＃34;％Y-％m-％d ％H：％M：％S％f＆＃34;） -

如何优化读取和处理大文件？

2 个答案: