我正在尝试从大型tarball文件创建文件名列表,我想了解为什么示例中的内存使用率仍然相同?是否是因为f.write()
仍在文件实际关闭之前仍在保存/缓冲内存中的所有对象?有办法改善吗?
# touch file{1..100000}.txt
# tar cf test.tar file*
生成器
# python test.py
Memory (Before): 40.918MB
Memory (After): 117.066MB
It took 12.636950492858887 seconds.
列表:
# python test.py
Memory (Before): 40.918MB
Memory (After): 117.832MB
It took 12.049121856689453 seconds.
test.py
#!/usr/bin/python3
import memory_profiler
import tarfile
import time
def files_generator(tar):
entry = tar.next()
while entry:
yield entry.name
entry = tar.next()
def files_list(tar):
return tar.getnames()
if __name__ == '__main__':
print(f'Memory (Before): {memory_profiler.memory_usage()[0]:.3f}MB')
start = time.time()
tar = tarfile.open('test.tar')
with open('output_g.txt', 'w') as f:
for i in files_generator(tar):
#for i in files_list(tar):
f.write(i + '\n')
end = time.time()
print(f'Memory (After): {memory_profiler.memory_usage()[0]:.3f}MB')
print(f'It took {end-start} seconds.')
答案 0 :(得分:2)
Tarfile.next()
方法缓存其内容,including the lines:
if tarinfo is not None:
self.members.append(tarinfo)
事实证明,Tarfile.getnames()
调用Tarfile.getmembers()
,后者调用Tarfile._load()
,后者反复调用Tarfile.next()
,直到全部读入self.members
。因此,Tarfile.getnames()
和通过Tarfile.next()
进行迭代将具有相同的内存使用情况。