Question

我正在编写一个简单的程序，它应该在每一行（link here）上读取一个大文件（确切地说是263.5gb）。我做了一些研究，我发现的最好的方法是逐行阅读。我看起来像这样（full code here）：

with open(dumpLocation, "r") as f:
for line in f:

    # Read line, convert to dictionary and assign it to 'c'
    c = json.loads(f.readline())

    for n in files:
        if n.lower() in c["title"].lower():

            try:
                # Collect data
                timestamp = str(c["retrieved_on"])
                sr_id = c["subreddit_id"]
                score = str(c["score"])
                ups = str(c["ups"])
                downs = str(c["downs"])
                title = ('"' + c["title"] + '"')

                # Append data to file
                files[n].write(timestamp + ","
                               + sr_id + ","
                               + score + ","
                               + ups + ","
                               + downs + ","
                               + title + ","
                               + "\n")
                found += 1
            except:
                numberOfErrors += 1
                errors[comments] = sys.exc_info()[0]

        comments += 1

        # Updates user
        print("Comments scanned: " + str(comments) + "\nFound: " + str(found) + "\n")

现在我可以让它运行，并且在它崩溃之前运行了一个小时（大约130万行）。我注意到在进程中内存使用量正在缓慢增长，并且在崩溃之前达到了大约2GB。

我需要排序大约2亿行，如果找到特定单词，我也会写文件（搜索5，在崩溃前找到337）。有没有更好的方法呢？我的电脑通常只有大约2GB的RAM备用

Answer 1

这里有内存泄漏：

except:
    numberOfErrors += 1
    errors[comments] = sys.exc_info()[0]

由于输入行数量巨大，错误数量也很大，尤其是算法中存在错误时。

普通except是邪恶的，因为它隐藏了所有错误，甚至是代码中的语法错误。您应该只处理您希望在实际数据上发生的特定异常类型，并使try-except块尽可能地缩小。

Answer 2

我找到了内存泄漏的地方。在这一行，我在每行之后打印到控制台：

 print("Comments scanned: " + str(comments) + "\nFound: " + str(found) + "\n")

打印2亿次，您的计算机必然会耗尽内存，试图立即将其全部保存在控制台中。删除它，它完美地工作：）

逐行读取大型文本文件仍在使用我的所有内存

2 个答案: