Question

我有一个由'块'数据组成的文件，其中标题表示文件中有多少块，每个块中有多少行。

# mydata.dat
3 12343 2
# comment
12343
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
12343
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
12343
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8

我想分别存储每个块。我所做的是通过分割行从每个块生成一个列表，然后将其转换为数组以存储在字典中。

import numpy as np

with open('mydata.dat', "r") as f:
    lines = f. readlines()
    blocks, _, n = map(int, lines[1].split())
    del lines[:3]
data = {i: np.array(dtype=float,
                    object=[line.split()
                            for line in lines[i * n + 1: n * (i + 1)]])
        for i in xrange(blocks)}

我觉得应该有一种更好的（更有效的）方法来直接解析文本块中的数据，例如np.loadtxt，你可以跳过周期性的行数（比如切片）而不仅仅是从文件的开头。

Answer 1

loadtxt执行您为每个块执行的操作 - 读取每一行，拆分它，根据dtype将其转换并将其附加到列表中，最后将其转换为数组。

在其他问题中已经指出，您可以将已打开的文件或任何可迭代文件传递给loadtxt。因此，您可以预处理文件，将其划分为块，跳过行等。但总的来说，它不会比您正在做的更有效。

所以这可能有效（我还没有测试过）：

data = {i: np.loadtxt(lines[i * n + 1: n * (i + 1)], dtype=float)
    for i in xrange(blocks)}

它更紧凑，但我怀疑它是否更快。

我能想到的唯一另一种方法是去除所有块大小的行，将剩余的行传递给loadtxt以获得所有数据的数组，然后将其拆分成块，例如与np.split(...)。

将txt作为样本中的行列表：

In [396]: timeit np.array([line.split() for line in txt[4:6]],dtype=float)
100000 loops, best of 3: 13 µs per loop
In [397]: timeit np.loadtxt(txt[4:6],dtype=float)
10000 loops, best of 3: 71.4 µs per loop

Answer 2

np.loadtxt() can take an iterable, so you can pass it slices of line The first row of data is row 2.

with open('mydata.dat', "r") as f:
    # load data, skipping comment lines
    line = [s for s in f if not s.startswith('#')]

    # parse first line to find out block size
    _, _, blocksize = map(int, line[0].split())

    # use np.loadtxt() to convert slices of the input
    data = [np.loadtxt(line[i:i+blocksize])
            for i in range(2, len(line), blocksize+1)]

You can skip loading the file into a string first by using itertools.islice:

with open('mydata.dat', "r") as f:
    # iterator over lines in f with comment lines removed
    lines = (line for line in f if not line.startswith('#'))

    # parse block structure
    nblks, _, blksz = map(int, next(lines).split())

    # convert "islice"s of the input file to np.arrays
    # start arg to islice is 1 to skip over block header line
    data = [np.loadtxt(it.islice(lines, 1, blksz + 1)) for i in range(nblks)]

Answer 3

np.loadtxt can take any iterable as parameter, and strip white-spaces to get a data array.

The following code gives np.loadtxt an iterable that yields full blocks, not lines.

import numpy as np
from itertools import islice

def chunks(iterable, blocks_count, block_size):
   for i in range(blocks_count):
       yield "".join(islice(iterable, block_size))

with open(r'c:\tmp\tmp.txt', "r") as f:
    file_iterator = iter(f)
    next(file_iterator) # skip first comment line
    blocks, _, n = map(int, next(file_iterator).split())
    next(file_iterator) # skip second comment line
    blocks_iterator = chunks(file_iterator, blocks, n)
    data = dict()
    i = 0
    for arr in np.loadtxt(dtype=float, fname=blocks_iterator):
        data[i] = arr
        i += 1

如何用numpy读取文件的不同部分或块

3 个答案: