Question

我的数据文件包含多个时间步长的数据，每个时间步长格式化为一个块，如下所示：

TIMESTEP  PARTICLES
0.00500103 1262
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
....

每个块由3个标题行和与时间步长相关的多行数据组成（第2行的int）。与块关联的数据行数可以在0到1000万之间变化。每个块之间可能有一个空行，但有时会丢失。

我希望能够逐块读取文件，在读取块后处理数据 - 文件很大（通常超过200GB），并且一次性步骤可以很好地加载到内存中。

由于文件格式，我认为编写一个读取3个标题行的函数非常容易，读取实际数据然后返回一个漂亮的numpy数组进行数据处理。我已经习惯了 MATLAB ，你可以简单地读取块而不是文件的末尾。我不太确定如何用python做到这一点。

我创建了以下函数来读取数据块：

def readBlock(f):
    particleData = []
    Timestep = []
    numParticles = []
    linesProcessed = 0

    line = f.readline().strip()
    if line.startswith('TIMESTEP'): 

        timestepHeaders = line.strip()
        varData = f.readline().strip()
        headerStrings = f.readline().strip().split(' ')
        parts = varData.strip().split(' ')
        Timestep = float(parts[0])
        numParticles = int(parts[1])
        while linesProcessed < numParticles:
            particleData.append(tuple(f.readline().strip().split(' ')))
            linesProcessed += 1

        mydt = np.dtype([ ('ID',int), 
                     ('GROUP', int),
                     ('Vol', float),
                     ('Mass', float),
                     ('Px', float),
                     ('Py', float),
                     ('Pz', float),
                     ('Vx', float),
                     ('Vy', float),
                     ('Vz', float),
                     ] )

        particleData = np.array(particleData, dtype=mydt)

    return Timestep, numParticles, particleData

我尝试运行这样的函数：

with open(fileOpenPath, 'r') as file:
    startWallTime = time.clock()

    Timestep, numParticles, particleData = readBlock(file)
    print(Timestep)

    ## Do processing stuff here 
    print("Timestep Processed")

    endWallTime = time.clock()

问题是这只会从文件中读取第一个数据块并停在那里 - 我不知道如何让它循环遍历文件直到它到达终点并停止。

关于如何完成这项工作的任何建议都会很棒。我想我可以用单行处理来编写一种方法，如果检查是否在时间步长结束时进行检查，但简单的功能似乎更容易和更清晰。

Answer 1

您可以使用numpy.genfromtxt的max_rows参数：

with open("timesteps.dat", "rb") as f:
    while True:
        line = f.readline()
        if len(line) == 0:
            # End of file
            break
        # Skip blank lines
        while len(line.strip()) == 0:
            line = f.readline()
        line2_fields = f.readline().split()
        timestep = float(line2_fields[0])
        particles = int(line2_fields[1])
        data = np.genfromtxt(f, names=True, dtype=None, max_rows=particles)

        print("Timestep:", timestep)
        print("Particles:", particles)
        print("Data:")
        print(data)
        print()

这是一个示例文件：

TIMESTEP  PARTICLES
0.00500103    4
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP  PARTICLES
0.00500103    5
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
652 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
431 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
385 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
972 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

TIMESTEP  PARTICLES
0.00500103    3
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
222 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
333 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
444 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

这是输出：

Timestep: 0.00500103
Particles: 4
Data:
[ (651, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
 (430, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
 (384, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
 (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]

Timestep: 0.00500103
Particles: 5
Data:
[ (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)
 (652, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
 (431, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
 (385, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
 (972, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]

Timestep: 0.00500103
Particles: 3
Data:
[ (222, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
 (333, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
 (444, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]

Answer 2

with不会循环，只会确保文件在之后正确关闭。

要循环，您需要在with语句之后添加一段时间（请参阅下面的代码）。但在此之前，您需要检查readBlock（f）函数以查找文件结尾（EOF）。使用以下代码替换line = f.readline().strip()：

line = f.readline()
if not line:
    # EOF: returning None's.
    return None, None, None
# We do the strip after the check.
# Otherwise a blank line "\n" might be interpreted as EOF.
line = line.strip()

因此在with块中添加while循环并检查我们是否返回None表示EOF，因此我们可以突破while循环：

with open('file1') as file_handle:
    while True:
        startWallTime = time.clock()

        Timestep, numParticles, particleData = readBlock(file_handle)
        if Timestep == None:
            break
        print(Timestep)

        ## Do processing stuff here 
        print("Timestep Processed")

        endWallTime = time.clock()

Answer 3

这里是一个快速肮脏的测试（第二次试用！）

import numpy as np

with open('stack41091659.txt','rb') as f:
    while f.readline():    # read the 'TIMESTEP  PARTICLES' line
        time, n = f.readline().strip().split()
        n = int(n)
        print(time, n)
        ablock = [f.readline()]  # block header line
        for i in range(n):
            ablock.append(f.readline())
        print(len(ablock))
        data = np.genfromtxt(ablock, dtype=None, names=True)
        print(data.shape, data.dtype)

试运行：

1458:~/mypy$ python3 stack41091659.py 
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 3
4
(3,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 2
3
(2,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]

示例文件：

TIMESTEP  PARTICLES
0.00500103 4
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP  PARTICLES
0.00500103 3
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
TIMESTEP  PARTICLES
0.00500103 2
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP  PARTICLES
0.00500103 4
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

我使用的事实是genfromtxt对任何能够为它提供线条的东西感到满意。在这里，我收集列表中的下一个块，并将其传递给genfromtxt。

使用max_rows的{{1}}参数，我可以告诉它直接阅读下一个genfromtxt行：

我没有考虑可选的空白行。可能会在块读取开始时挤压它。即读取行，直到我得到一个有效的with open('stack41091659.txt','rb') as f: while f.readline(): time, n = f.readline().strip().split() n = int(n) print(time, n) data = np.genfromtxt(f, dtype=None, names=True, max_rows=n) print(data.shape, len(data.dtype.names))字符串对。

Python函数在打开时从文件中读取可变长度的数据块

3 个答案: