Question

我有一个巨大的文件（大约30GB），每行包括2D表面上一个点的协调。我需要将文件加载到Numpy数组中：points = np.empty((0, 2))，并在其上应用scipy.spatial.ConvexHull。由于文件的大小非常大，我无法立即将其加载到内存中，我想将其作为N行的批量加载并在小部分上应用scipy.spatial.ConvexHull然后加载下一个N行！做什么是有效的？
我found out在python中你可以使用islice读取文件的N行，但问题是lines_gen是一个生成器对象，它为你提供文件的每一行，应该使用在循环中，所以我不知道如何以有效的方式将lines_gen转换为Numpy数组？

from itertools import islice
with open(input, 'r') as infile:
    lines_gen = islice(infile, N)

我的输入文件：

0.989703    1
0   0
0.0102975   0
0.0102975   0
1   1
0.989703    1
1   1
0   0
0.0102975   0
0.989703    1
0.979405    1
0   0
0.020595    0
0.020595    0
1   1
0.979405    1
1   1
0   0
0.020595    0
0.979405    1
0.969108    1
...
...
...
0   0
0.0308924   0
0.0308924   0
1   1
0.969108    1
1   1
0   0
0.0308924   0
0.969108    1
0.95881 1
0   0

Answer 1

根据您的数据，我可以用以下5行块读取它：

In [182]: with open(input,'r') as infile:
    while True:
        gen = islice(infile,N)
        arr = np.genfromtxt(gen, dtype=None)
        print arr
        if arr.shape[0]<N:
            break
   .....:             
[(0.989703, 1) (0.0, 0) (0.0102975, 0) (0.0102975, 0) (1.0, 1)]
[(0.989703, 1) (1.0, 1) (0.0, 0) (0.0102975, 0) (0.989703, 1)]
[(0.979405, 1) (0.0, 0) (0.020595, 0) (0.020595, 0) (1.0, 1)]
[(0.979405, 1) (1.0, 1) (0.0, 0) (0.020595, 0) (0.979405, 1)]
[(0.969108, 1) (0.0, 0) (0.0308924, 0) (0.0308924, 0) (1.0, 1)]
[(0.969108, 1) (1.0, 1) (0.0, 0) (0.0308924, 0) (0.969108, 1)]
[(0.95881, 1) (0.0, 0)]

同样的事情就像一个块一样：

In [183]: with open(input,'r') as infile:
    arr = np.genfromtxt(infile, dtype=None)
   .....:     
In [184]: arr
Out[184]: 
array([(0.989703, 1), (0.0, 0), (0.0102975, 0), (0.0102975, 0), (1.0, 1),
       (0.989703, 1), (1.0, 1), (0.0, 0), (0.0102975, 0), (0.989703, 1),
       (0.979405, 1), (0.0, 0), (0.020595, 0), (0.020595, 0), (1.0, 1),
       (0.979405, 1), (1.0, 1), (0.0, 0), (0.020595, 0), (0.979405, 1),
       (0.969108, 1), (0.0, 0), (0.0308924, 0), (0.0308924, 0), (1.0, 1),
       (0.969108, 1), (1.0, 1), (0.0, 0), (0.0308924, 0), (0.969108, 1),
       (0.95881, 1), (0.0, 0)], 
      dtype=[('f0', '<f8'), ('f1', '<i4')])

（这是在Python 2.7中;在3中我需要解决一个字节/字符串问题）。

Answer 2

您可以尝试使用此post中的第二个方法，并通过使用预先计算的行偏移数组（如果它适合内存）引用给定行来读取块中的文件。这是我通常用来避免将整个文件加载到内存::

的示例

data_file = open("data_file.txt", "rb") 

line_offset = []
offset = 0

while 1:
    lines = data_file.readlines(100000)
    if not lines:
        break

    for line in lines:
        line_offset.append(offset)
        offset += len(line)

# reading a line
line_to_read = 1
line = ''

data_file.seek(line_offset[line_to_read])   
line = data_file.readline()

Answer 3

您可以使用生成器

按如下方式定义块读取器

def read_file_chunk(fname, chunksize=500000):
    with open(fname, 'r') as myfile:
        lines = []
        for i, line in enumerate(myfile):
            line_values = (float(val) for val in line.split())
            lines.append(line_values)
            if i > 0 and i % 5 == 0:
                yield lines
                lines = [] # resets the lines list
        if lines:
            yield lines # final few lines of file.

# and, assuming the function you want to apply is called `my_func`
chunk_gen = read_file_chunk(my_file_name)
for chunk in chunk_gen:
    my_func(chunk)

Answer 4

您可以查看DAGpype的chunk_stream_bytes。我没有使用它，但我希望它会有所帮助。

这是块读取和处理某些.csv文件（_f_name）的示例：

 np.chunk_stream_bytes(_f_name, num_cols = 2) | \
        filt(lambda a : a[logical_and(a[:, 0] < 10, a[:, 1] < 10), :]) | \
        np.corr()

如何一次将超大文件读入Numpy数组N行

4 个答案: