Line＃Mem使用量增加行内容

Question

我试图运行一些用Python分析数据的脚本，我很快就意识到需要多少RAM空间：

我的脚本从文件中读取两列整数。它以下列方式导入它：

import numpy as N
from sys import argv
infile = argv[1]
data = N.loadtxt(infile,dtype=N.int32)  //infile is the input file

对于一个有近800万行的文件，ram需要大约1.5 Gb（在这个阶段它只是导入数据）。

我尝试在其上运行内存分析器，给我：

Line＃Mem使用量增加行内容

 5   17.664 MiB    0.000 MiB   @profile
 6                             def func():
 7   17.668 MiB    0.004 MiB    infile = argv[1]
 8  258.980 MiB  241.312 MiB    data = N.loadtxt(infile,dtype=N.int32)

所以数据为250Mb，远离内存中的1.5Gb（占用这么多空间的是什么？）

当我尝试使用int16而不是int32将其除以2时

Line＃Mem使用量增加行内容

 5   17.664 MiB    0.000 MiB   @profile
 6                             def func():
 7   17.668 MiB    0.004 MiB    infile = argv[1]
 8  229.387 MiB  211.719 MiB    data = N.loadtxt(infile,dtype=N.int16)

但是我只保存了十分之一，为什么会这样？

我不太了解记忆占用，但这是正常的吗？

另外，我在C ++中编码相同的东西，在vector<int>个对象中存储数据，而RAM中只需要120Mb。

对我而言，Python在处理内存方面似乎有很大的发展空间，它在做什么会增加数据的重量？它与Numpy有关吗？

受以下答案的启发，我现在以下列方式导入我的数据：

infile = argv[1]
output = commands.getoutput("wc -l " + infile) #I'm using the wc linux command to read the number of lines in my file and so how much memory allocation do I need
n_lines = int(output.split(" ")[0]) #the first int is the number of lines
data = N.empty((n_lines,2),dtype=N.int16) #allocating
datafile = open(infile)
for count,line in enumerate(datafile): #reading line by line
    data[count] = line.split(" ") #filling the array

它与多个文件的工作方式非常相似：

infiles = argv[1:]
n_lines = sum(int(commands.getoutput("wc -l " + infile).split(" ")[0]) for infile in infiles)
i = 0
data = N.empty((n_lines,2),dtype=N.int16)
for infile in infiles:
    datafile = open(infile)
    for line in datafile:
        data[i] = line.split(" ")
        i+=1

罪魁祸首似乎是numpy.loadtxt，删除后，我的脚本现在不需要大量的内存，甚至运行速度快2-3倍=）

Answer 1

loadtxt()方法不具有内存效率，因为它使用Python列表临时存储文件内容。 Here简要解释了为什么Python列表占用了这么多空间。

一种解决方案是创建自己的读取文本文件的实现，如下所示：

buffsize = 10000  # Increase this for large files
data = N.empty((buffsize, ncols))  # Init array with buffsize
dataFile = open(infile)

for count, line in enumerate(dataFile):
   if count >= len(data):
       data.resize((count + buffsize, ncols), recheck=False)
   line_values = ... <convert line into values> ...
   data[count] = line_values

# Fix array size
data.resize((count+1, ncols), recheck=False)
dataFile.close()

由于有时我们无法提前计算行数，因此我定义了一种缓冲以避免一直调整数组的大小。

注意：首先，我想出了一个使用numpy.append的解决方案。但正如评论中指出的那样，append也是低效的，因为它复制了数组内容。

Python和内存导入2d数据的有效方法

Line＃Mem使用量增加行内容

Line＃Mem使用量增加行内容

1 个答案: