如何用numpy读取文件的不同部分或块

时间:2016-02-12 17:41:19

标签: python numpy

我有一个由'块'数据组成的文件,其中标题表示文件中有多少块,每个块中有多少行。

# mydata.dat
3 12343 2
# comment
12343
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
12343
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
12343
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8

我想分别存储每个块。我所做的是通过分割行从每个块生成一个列表,然后将其转换为数组以存储在字典中。

import numpy as np

with open('mydata.dat', "r") as f:
    lines = f. readlines()
    blocks, _, n = map(int, lines[1].split())
    del lines[:3]
data = {i: np.array(dtype=float,
                    object=[line.split()
                            for line in lines[i * n + 1: n * (i + 1)]])
        for i in xrange(blocks)}

我觉得应该有一种更好的(更有效的)方法来直接解析文本块中的数据,例如np.loadtxt,你可以跳过周期性的行数(比如切片)而不仅仅是从文件的开头。

3 个答案:

答案 0 :(得分:1)

loadtxt执行您为每个块执行的操作 - 读取每一行,拆分它,根据dtype将其转换并将其附加到列表中,最后将其转换为数组。

在其他问题中已经指出,您可以将已打开的文件或任何可迭代文件传递给loadtxt。因此,您可以预处理文件,将其划分为块,跳过行等。但总的来说,它不会比您正在做的更有效。

所以这可能有效(我还没有测试过):

data = {i: np.loadtxt(lines[i * n + 1: n * (i + 1)], dtype=float)
    for i in xrange(blocks)}

它更紧凑,但我怀疑它是否更快。

我能想到的唯一另一种方法是去除所有块大小的行,将剩余的行传递给loadtxt以获得所有数据的数组,然后将其拆分成块,例如与np.split(...)

txt作为样本中的行列表:

In [396]: timeit np.array([line.split() for line in txt[4:6]],dtype=float)
100000 loops, best of 3: 13 µs per loop
In [397]: timeit np.loadtxt(txt[4:6],dtype=float)
10000 loops, best of 3: 71.4 µs per loop

答案 1 :(得分:1)

np.loadtxt() can take an iterable, so you can pass it slices of line The first row of data is row 2.

with open('mydata.dat', "r") as f:
    # load data, skipping comment lines
    line = [s for s in f if not s.startswith('#')]

    # parse first line to find out block size
    _, _, blocksize = map(int, line[0].split())

    # use np.loadtxt() to convert slices of the input
    data = [np.loadtxt(line[i:i+blocksize])
            for i in range(2, len(line), blocksize+1)]

You can skip loading the file into a string first by using itertools.islice:

with open('mydata.dat', "r") as f:
    # iterator over lines in f with comment lines removed
    lines = (line for line in f if not line.startswith('#'))

    # parse block structure
    nblks, _, blksz = map(int, next(lines).split())

    # convert "islice"s of the input file to np.arrays
    # start arg to islice is 1 to skip over block header line
    data = [np.loadtxt(it.islice(lines, 1, blksz + 1)) for i in range(nblks)]

答案 2 :(得分:0)

np.loadtxt can take any iterable as parameter, and strip white-spaces to get a data array.

The following code gives np.loadtxt an iterable that yields full blocks, not lines.

import numpy as np
from itertools import islice

def chunks(iterable, blocks_count, block_size):
   for i in range(blocks_count):
       yield "".join(islice(iterable, block_size))

with open(r'c:\tmp\tmp.txt', "r") as f:
    file_iterator = iter(f)
    next(file_iterator) # skip first comment line
    blocks, _, n = map(int, next(file_iterator).split())
    next(file_iterator) # skip second comment line
    blocks_iterator = chunks(file_iterator, blocks, n)
    data = dict()
    i = 0
    for arr in np.loadtxt(dtype=float, fname=blocks_iterator):
        data[i] = arr
        i += 1