我有一个由'块'数据组成的文件,其中标题表示文件中有多少块,每个块中有多少行。
# mydata.dat
3 12343 2
# comment
12343
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
12343
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
12343
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
我想分别存储每个块。我所做的是通过分割行从每个块生成一个列表,然后将其转换为数组以存储在字典中。
import numpy as np
with open('mydata.dat', "r") as f:
lines = f. readlines()
blocks, _, n = map(int, lines[1].split())
del lines[:3]
data = {i: np.array(dtype=float,
object=[line.split()
for line in lines[i * n + 1: n * (i + 1)]])
for i in xrange(blocks)}
我觉得应该有一种更好的(更有效的)方法来直接解析文本块中的数据,例如np.loadtxt
,你可以跳过周期性的行数(比如切片)而不仅仅是从文件的开头。
答案 0 :(得分:1)
loadtxt
执行您为每个块执行的操作 - 读取每一行,拆分它,根据dtype
将其转换并将其附加到列表中,最后将其转换为数组。
在其他问题中已经指出,您可以将已打开的文件或任何可迭代文件传递给loadtxt
。因此,您可以预处理文件,将其划分为块,跳过行等。但总的来说,它不会比您正在做的更有效。
所以这可能有效(我还没有测试过):
data = {i: np.loadtxt(lines[i * n + 1: n * (i + 1)], dtype=float)
for i in xrange(blocks)}
它更紧凑,但我怀疑它是否更快。
我能想到的唯一另一种方法是去除所有块大小的行,将剩余的行传递给loadtxt
以获得所有数据的数组,然后将其拆分成块,例如与np.split(...)
。
将txt
作为样本中的行列表:
In [396]: timeit np.array([line.split() for line in txt[4:6]],dtype=float)
100000 loops, best of 3: 13 µs per loop
In [397]: timeit np.loadtxt(txt[4:6],dtype=float)
10000 loops, best of 3: 71.4 µs per loop
答案 1 :(得分:1)
np.loadtxt()
can take an iterable, so you can pass it slices of line
The first row of data is row 2.
with open('mydata.dat', "r") as f:
# load data, skipping comment lines
line = [s for s in f if not s.startswith('#')]
# parse first line to find out block size
_, _, blocksize = map(int, line[0].split())
# use np.loadtxt() to convert slices of the input
data = [np.loadtxt(line[i:i+blocksize])
for i in range(2, len(line), blocksize+1)]
You can skip loading the file into a string first by using itertools.islice
:
with open('mydata.dat', "r") as f:
# iterator over lines in f with comment lines removed
lines = (line for line in f if not line.startswith('#'))
# parse block structure
nblks, _, blksz = map(int, next(lines).split())
# convert "islice"s of the input file to np.arrays
# start arg to islice is 1 to skip over block header line
data = [np.loadtxt(it.islice(lines, 1, blksz + 1)) for i in range(nblks)]
答案 2 :(得分:0)
np.loadtxt
can take any iterable as parameter, and strip white-spaces to get a data array.
The following code gives np.loadtxt an iterable that yields full blocks, not lines.
import numpy as np
from itertools import islice
def chunks(iterable, blocks_count, block_size):
for i in range(blocks_count):
yield "".join(islice(iterable, block_size))
with open(r'c:\tmp\tmp.txt', "r") as f:
file_iterator = iter(f)
next(file_iterator) # skip first comment line
blocks, _, n = map(int, next(file_iterator).split())
next(file_iterator) # skip second comment line
blocks_iterator = chunks(file_iterator, blocks, n)
data = dict()
i = 0
for arr in np.loadtxt(dtype=float, fname=blocks_iterator):
data[i] = arr
i += 1