Question

我有一个文件，我只需要将某些值读入数组。该文件按行指定，指定TIMESTEP值。我需要文件中最高TIMESTEP之后的数据部分。

这些文件将包含超过200,000行，但我不知道对于任何给定文件我需要哪一行开始，我不知道最大TIMESTEP值是多少。

我假设如果我能找到最大TIMESTEP的行号，那么我可以从该行开始导入。所有这些TIMESTEP行都以空格字符开头。关于我如何进行的任何想法都会有所帮助。

示例文件

 headerline 1 to skip
 headerline 2 to skip
 headerline 3 to skip
 TIMESTEP =    0.00000000    
0,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
1,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
2,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
2,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
 TIMESTEP =   0.119999997    
0,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
1,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
2,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
3,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
 TIMESTEP =    3.00000000    
0,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
1,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
1,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
2,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0

基本代码

import numpy as np

with open('myfile.txt') as f_in:
  data = np.genfromtxt(f_in, skip_header=3, comments=" ")

Answer 1

您可以在使用filter()时精确使用genfromtxt()，因为genfromtxt接受生成器。

with open('myfile.txt', 'rb') as f_in:
    lines = filter(lambda x: not x.startswith(b' '), f_in)
    data = genfromtxt(lines, delimiter=',')

然后在您的情况下，您不需要skip_header。

Answer 2

您可以使用自定义iterator。

这是一个有效的例子：

来自numpy import genfromtxt

class Iter(object):
    ' a custom iterator which returns a timestep and corresponding data '

    def __init__(self, fd):
        self.__fd = fd
        self.__timestep = None
        self.__next_timestep = None
        self.__finish = False
        for _ in self.to_next_timestep(): pass # skip header

    def to_next_timestep(self):
        ' iterate until next timestep '
        for line in self.__fd:
            if 'TIMESTEP' in line:
                self.__timestep = self.__next_timestep
                self.__next_timestep = float(line.split('=')[1])
                return
            yield line
        self.__timestep = self.__next_timestep
        self.__finish = True

    def __iter__(self): return self

    def next(self):
        if self.__finish:
            raise StopIteration
        data = genfromtxt(self.to_next_timestep(), delimiter=',')
        return self.__timestep, data

with open('myfile.txt') as fd:
    iter = Iter(fd)
    for timestep, data in iter:
        print timestep, data # data can be selected upon highest timestep

Answer 3

这是一个使用常规Python文件读取的解决方案，将genfromtxt应用于行列表。为了便于说明，我正在解析每个数据块，但可以轻松修改它以跳过不符合您的时间步长标准的块。

我首先使用StringIO编写了这个，正如许多genfromtxt doc示例中所使用的那样，但它所需要的只是一个可迭代的。所以行列表工作正常。

import numpy as np
filename = 'stack26008436.txt'

def parse(tstep, block):
    print tstep
    print np.genfromtxt(block, delimiter=',')

with open(filename) as f:
    block = []
    for line in f:
        if 'TIMESTEP' in line:
            if block:
                parse(tstep, block)
            block = []
            tstep = float(line.strip().split('=')[1])
        else:
            if 'header' not in line:
                block.append(line)
    parse(tstep, block)

从您的样本中生成：

0901:~/mypy$ python2.7 stack26008436.py
0.0
[[ 0.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.]
 ...
 [ 3.  1.  1.  1.  1.  1.  1.]]
3.0
[[ 0.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.]
 [ 2.  1.  1.  1.  1.  1.  1.]]

使用numpy.genfromtxt过滤

3 个答案: