我有一个巨大的文件(大约30GB),每行包括2D表面上一个点的协调。我需要将文件加载到Numpy数组中:points = np.empty((0, 2))
,并在其上应用scipy.spatial.ConvexHull
。由于文件的大小非常大,我无法立即将其加载到内存中,我想将其作为N行的批量加载并在小部分上应用scipy.spatial.ConvexHull
然后加载下一个N行!做什么是有效的?
我found out在python中你可以使用islice
读取文件的N行,但问题是lines_gen
是一个生成器对象,它为你提供文件的每一行,应该使用在循环中,所以我不知道如何以有效的方式将lines_gen
转换为Numpy数组?
from itertools import islice
with open(input, 'r') as infile:
lines_gen = islice(infile, N)
我的输入文件:
0.989703 1
0 0
0.0102975 0
0.0102975 0
1 1
0.989703 1
1 1
0 0
0.0102975 0
0.989703 1
0.979405 1
0 0
0.020595 0
0.020595 0
1 1
0.979405 1
1 1
0 0
0.020595 0
0.979405 1
0.969108 1
...
...
...
0 0
0.0308924 0
0.0308924 0
1 1
0.969108 1
1 1
0 0
0.0308924 0
0.969108 1
0.95881 1
0 0
答案 0 :(得分:4)
根据您的数据,我可以用以下5行块读取它:
In [182]: with open(input,'r') as infile:
while True:
gen = islice(infile,N)
arr = np.genfromtxt(gen, dtype=None)
print arr
if arr.shape[0]<N:
break
.....:
[(0.989703, 1) (0.0, 0) (0.0102975, 0) (0.0102975, 0) (1.0, 1)]
[(0.989703, 1) (1.0, 1) (0.0, 0) (0.0102975, 0) (0.989703, 1)]
[(0.979405, 1) (0.0, 0) (0.020595, 0) (0.020595, 0) (1.0, 1)]
[(0.979405, 1) (1.0, 1) (0.0, 0) (0.020595, 0) (0.979405, 1)]
[(0.969108, 1) (0.0, 0) (0.0308924, 0) (0.0308924, 0) (1.0, 1)]
[(0.969108, 1) (1.0, 1) (0.0, 0) (0.0308924, 0) (0.969108, 1)]
[(0.95881, 1) (0.0, 0)]
同样的事情就像一个块一样:
In [183]: with open(input,'r') as infile:
arr = np.genfromtxt(infile, dtype=None)
.....:
In [184]: arr
Out[184]:
array([(0.989703, 1), (0.0, 0), (0.0102975, 0), (0.0102975, 0), (1.0, 1),
(0.989703, 1), (1.0, 1), (0.0, 0), (0.0102975, 0), (0.989703, 1),
(0.979405, 1), (0.0, 0), (0.020595, 0), (0.020595, 0), (1.0, 1),
(0.979405, 1), (1.0, 1), (0.0, 0), (0.020595, 0), (0.979405, 1),
(0.969108, 1), (0.0, 0), (0.0308924, 0), (0.0308924, 0), (1.0, 1),
(0.969108, 1), (1.0, 1), (0.0, 0), (0.0308924, 0), (0.969108, 1),
(0.95881, 1), (0.0, 0)],
dtype=[('f0', '<f8'), ('f1', '<i4')])
(这是在Python 2.7中;在3中我需要解决一个字节/字符串问题)。
答案 1 :(得分:1)
您可以尝试使用此post中的第二个方法,并通过使用预先计算的行偏移数组(如果它适合内存)引用给定行来读取块中的文件。这是我通常用来避免将整个文件加载到内存::
的示例data_file = open("data_file.txt", "rb")
line_offset = []
offset = 0
while 1:
lines = data_file.readlines(100000)
if not lines:
break
for line in lines:
line_offset.append(offset)
offset += len(line)
# reading a line
line_to_read = 1
line = ''
data_file.seek(line_offset[line_to_read])
line = data_file.readline()
答案 2 :(得分:1)
您可以使用生成器
按如下方式定义块读取器def read_file_chunk(fname, chunksize=500000):
with open(fname, 'r') as myfile:
lines = []
for i, line in enumerate(myfile):
line_values = (float(val) for val in line.split())
lines.append(line_values)
if i > 0 and i % 5 == 0:
yield lines
lines = [] # resets the lines list
if lines:
yield lines # final few lines of file.
# and, assuming the function you want to apply is called `my_func`
chunk_gen = read_file_chunk(my_file_name)
for chunk in chunk_gen:
my_func(chunk)
答案 3 :(得分:0)
您可以查看DAGpype的chunk_stream_bytes。 我没有使用它,但我希望它会有所帮助。
这是块读取和处理某些.csv文件(_f_name)的示例:
np.chunk_stream_bytes(_f_name, num_cols = 2) | \
filt(lambda a : a[logical_and(a[:, 0] < 10, a[:, 1] < 10), :]) | \
np.corr()