Quickly convert a very large number of lists of strings into ndarray

时间:2017-08-05 11:31:11

标签: python arrays performance numpy file-io

My file looks as follows, where the first 3 numbers per line denote a triangle/triplet of things, and the 4th number is a marker for each triangle:

1 2 3 1
5 6 7 0
300 10 11 5
0 14 15 9

I currently read it as follows:

import numpy as np
file = open(fname, 'r')
lines = [x for x in file.readlines() if not x.startswith('#')]

n = ... # number of lines to read
tri = np.empty([n, 3], dtype=int) # array of triplets
tri_mark = np.empty([n], dtype=int) # a marker for each triplet
for i in range(n):
    s = lines[i].split()
    tri[i, :] = [int(v) for v in s[ : -1]]
    tri_mark[i] = int(s[-1])

When the number of lines goes into the millions, it turns out, that the for loop is an incredible bottleneck. I observe that an external program that I also use can read the file very quick, so I think it should be possible to read and convert much faster.

Is there a way to faster convert the list of strings into an ndarray?

(Switching to a binary file is currently not an option.)

1 个答案:

答案 0 :(得分:3)

Use np.loadtxt to read in the whole file:

>>> import numpy as np
>>> arr = np.loadtxt(fname, dtype=int)
>>> arr
array([[  1,   2,   3,   1],
       [  5,   6,   7,   0],
       [300,  10,  11,   5],
       [  0,  14,  15,   9]])

and then slicing to get the appropriate subarrays:

>>> tri = arr[:, 0:3]
>>> tri
array([[  1,   2,   3],
       [  5,   6,   7],
       [300,  10,  11],
       [  0,  14,  15]])
>>> tri_mark = arr[:, 3]
>>> tri_mark
array([1, 0, 5, 9])