Question

我想使用Kaggle上的Jupyter Notebook循环处理许多mp3文件。但是，以二进制形式读取mp3文件似乎确实将文件保留在内存中，即使函数已返回并且文件已正确关闭。这将导致内存使用量随着每个处理文件的增加而增加。问题似乎出在read()函数中，因为pass不会导致任何内存使用量增长。

遍历mp3文件时，内存使用量的增长等于正在处理的文件的大小，这暗示着将文件保留在内存中。

函数返回后，如何在不保存文件的情况下读取文件？

def read_mp3_as_bin(fname):
    with open(fname, "rb") as f:
        data = f.read() # when using 'pass' memory usage doesn't grow
    print(f.closed)
    return

for fname in file_names: # file_names are 25K paths to the mp3 files
    read_mp3_as_bin(fname)

“解决方案”

我确实在本地运行了此代码，并且内存使用量完全没有增长。因此，看起来Kaggle确实处理文件的方式有所不同，因为这是此测试中唯一的变量。我将尝试找出为什么此代码在Kaggle上表现不同的原因，并在我了解更多信息时让您知道。

Answer 1

我很确定您正在测量错误使用的内存。

我创建了3个每个大小为50MB的虚拟文件，并在它们上运行代码，在每次循环迭代中输出函数内部和外部的内存使用情况，结果与关闭文件后释放的内存一致。

要测量内存使用情况，我使用了建议的解决方案here，而根据this blog post的建议，我只运行了truncate -s 50M test_1.txt就创建了虚拟文件。

看看：

import os
import psutil


def read_mp3_as_bin(fname):
    with open(fname, "rb") as f:
        data = f.read()  # when using 'pass' memory usage doesn't grow
    if data:
        print("read data")

    process = psutil.Process(os.getpid())
    print(f"inside the function, it is using {process.memory_info().rss / 1024 / 1024} MB")  # in Megabytes
    return


file_names = ['test_1.txt', 'test_2.txt', 'test_3.txt']

for fname in file_names:  # file_names are 25K paths to the mp3 files
    read_mp3_as_bin(fname)
    process = psutil.Process(os.getpid())
    print(f"outside the function, it is using {process.memory_info().rss / 1024 / 1024} MB")  # in Megabytes

输出：

read data
inside the function, it is using 61.77734375 MB
outside the function, it is using 11.91015625 MB
read data
inside the function, it is using 61.6640625 MB
outside the function, it is using 11.9140625 MB
read data
inside the function, it is using 61.66796875 MB
outside the function, it is using 11.91796875 MB

文件在关闭后仍保留在内存中

1 个答案: