我的数据文件包含多个时间步长的数据,每个时间步长格式化为一个块,如下所示:
TIMESTEP PARTICLES
0.00500103 1262
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
....
每个块由3个标题行和与时间步长相关的多行数据组成(第2行的int)。与块关联的数据行数可以在0到1000万之间变化。每个块之间可能有一个空行,但有时会丢失。
我希望能够逐块读取文件,在读取块后处理数据 - 文件很大(通常超过200GB),并且一次性步骤可以很好地加载到内存中。
由于文件格式,我认为编写一个读取3个标题行的函数非常容易,读取实际数据然后返回一个漂亮的numpy数组进行数据处理。 我已经习惯了 MATLAB ,你可以简单地读取块而不是文件的末尾。我不太确定如何用python做到这一点。
我创建了以下函数来读取数据块:
def readBlock(f):
particleData = []
Timestep = []
numParticles = []
linesProcessed = 0
line = f.readline().strip()
if line.startswith('TIMESTEP'):
timestepHeaders = line.strip()
varData = f.readline().strip()
headerStrings = f.readline().strip().split(' ')
parts = varData.strip().split(' ')
Timestep = float(parts[0])
numParticles = int(parts[1])
while linesProcessed < numParticles:
particleData.append(tuple(f.readline().strip().split(' ')))
linesProcessed += 1
mydt = np.dtype([ ('ID',int),
('GROUP', int),
('Vol', float),
('Mass', float),
('Px', float),
('Py', float),
('Pz', float),
('Vx', float),
('Vy', float),
('Vz', float),
] )
particleData = np.array(particleData, dtype=mydt)
return Timestep, numParticles, particleData
我尝试运行这样的函数:
with open(fileOpenPath, 'r') as file:
startWallTime = time.clock()
Timestep, numParticles, particleData = readBlock(file)
print(Timestep)
## Do processing stuff here
print("Timestep Processed")
endWallTime = time.clock()
问题是这只会从文件中读取第一个数据块并停在那里 - 我不知道如何让它循环遍历文件直到它到达终点并停止。
关于如何完成这项工作的任何建议都会很棒。我想我可以用单行处理来编写一种方法,如果检查是否在时间步长结束时进行检查,但简单的功能似乎更容易和更清晰。
答案 0 :(得分:2)
您可以使用numpy.genfromtxt
的max_rows
参数:
with open("timesteps.dat", "rb") as f:
while True:
line = f.readline()
if len(line) == 0:
# End of file
break
# Skip blank lines
while len(line.strip()) == 0:
line = f.readline()
line2_fields = f.readline().split()
timestep = float(line2_fields[0])
particles = int(line2_fields[1])
data = np.genfromtxt(f, names=True, dtype=None, max_rows=particles)
print("Timestep:", timestep)
print("Particles:", particles)
print("Data:")
print(data)
print()
这是一个示例文件:
TIMESTEP PARTICLES
0.00500103 4
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 5
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
652 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
431 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
385 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
972 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 3
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
222 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
333 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
444 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
这是输出:
Timestep: 0.00500103
Particles: 4
Data:
[ (651, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
(430, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(384, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
Timestep: 0.00500103
Particles: 5
Data:
[ (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)
(652, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
(431, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(385, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(972, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
Timestep: 0.00500103
Particles: 3
Data:
[ (222, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
(333, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
(444, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
答案 1 :(得分:0)
with不会循环,只会确保文件在之后正确关闭。
要循环,您需要在with语句之后添加一段时间(请参阅下面的代码)。但在此之前,您需要检查readBlock(f)函数以查找文件结尾(EOF)。使用以下代码替换line = f.readline().strip()
:
line = f.readline()
if not line:
# EOF: returning None's.
return None, None, None
# We do the strip after the check.
# Otherwise a blank line "\n" might be interpreted as EOF.
line = line.strip()
因此在with块中添加while循环并检查我们是否返回None
表示EOF,因此我们可以突破while循环:
with open('file1') as file_handle:
while True:
startWallTime = time.clock()
Timestep, numParticles, particleData = readBlock(file_handle)
if Timestep == None:
break
print(Timestep)
## Do processing stuff here
print("Timestep Processed")
endWallTime = time.clock()
答案 2 :(得分:0)
这里是一个快速肮脏的测试(第二次试用!)
import numpy as np
with open('stack41091659.txt','rb') as f:
while f.readline(): # read the 'TIMESTEP PARTICLES' line
time, n = f.readline().strip().split()
n = int(n)
print(time, n)
ablock = [f.readline()] # block header line
for i in range(n):
ablock.append(f.readline())
print(len(ablock))
data = np.genfromtxt(ablock, dtype=None, names=True)
print(data.shape, data.dtype)
试运行:
1458:~/mypy$ python3 stack41091659.py
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 3
4
(3,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 2
3
(2,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
示例文件:
TIMESTEP PARTICLES
0.00500103 4
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 3
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 2
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 4
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
我使用的事实是genfromtxt
对任何能够为它提供线条的东西感到满意。在这里,我收集列表中的下一个块,并将其传递给genfromtxt
。
使用max_rows
的{{1}}参数,我可以告诉它直接阅读下一个genfromtxt
行:
n
我没有考虑可选的空白行。可能会在块读取开始时挤压它。即读取行,直到我得到一个有效的with open('stack41091659.txt','rb') as f:
while f.readline():
time, n = f.readline().strip().split()
n = int(n)
print(time, n)
data = np.genfromtxt(f, dtype=None, names=True, max_rows=n)
print(data.shape, len(data.dtype.names))
字符串对。