我有一个文本文件,其中包含以下格式的块
...some lines before this...
MY TEST MATRIX (ROWS)
0.5056E+03 0.8687E-03 -0.1202E-02
0.5056E+03 0.8687E-03 -0.1202E-02
MY TEST END
0.5056E+03 0.8687E-03 -0.1202E-02
0.3776E+03 0.8687E-03 0.1975E-04
STOP
---some lines after this
MY TEST MATRIX (ROWS)
2E+04 2E+04 0.8687E-03
2E+04 2E+04 0.8687E-03
MY TEST END
0.5056E+03 0.8687E-03 -0.1202E-02
0.5056E+03 0.8687E-03 -0.1202E-02
STOP
---some lines after this
---this repeats in txt file----
有许多这样的块,并且块在文本文件中的不同位置出现。我只想将出现在“我的测试矩阵(行)”和“我的测试结束”,“我的测试结束”和“停止”之间的值提取到各个数组中,就可以将它们称为firstvalue []和secondvalue []。
对我来说,一个区块是“我的测试矩阵-我的测试结束”
使用如下所示的简单代码,我可以从文本文件中读取一个数据块。但是,由于我的文本文件中有重复的块,所以我不知道如何从上面两个数组中的每个块中捕获数据。
import os
import sys
from math import *
firstValue = []
secondValue = []
checkFirst = False
checkSecond = False
filename="r3dmdtr2.txt"
with open(filename, "r") as infile:
for line in infile:
if line.strip().startswith("MY TEST MATRIX (ROWS)"):
checkFirst = True
if line.strip().startswith("MY TEST END"):
checkFirst = False
checkSecond = True
if line.strip().startswith("STOP"):
checkSecond = False
if checkFirst:
firstValue.append(line)
if checkSecond:
secondValue.append(line)
print(firstValue)
print (secondValue)
以上片段完美地读取了一块数据。如何解析文本文件中的所有重复块,并将它们作为单个数组附加到我的firstValue []
中类似的东西:
firstvalue = [[来自第一个块的值],[来自第二个块的值],依此类推... secondvalue = [[[firstblock的值],[secondblock的值],依此类推...
答案 0 :(得分:1)
您可以使用re.findall
>>> import re
>>> data = open('file.txt').read()
>>> blocks = re.findall(r'MY TEST MATRIX \(ROWS\)\s*(.*?)\s+MY TEST END\s*(.*?)\s+STOP', data, re.DOTALL)
>>> first, second = zip(*blocks)
>>> print (first)
('2X+00 2X+00 1X+00 \n 2X+00 2X+00 1K+00', '2P+00 2X+00 1M+00 \n 2X+00 2Z+00 1K+00')
>>> print (second)
('2Y+00 2Y+00 1E+00 \n 2Y+00 2Z+00 1E+00', '2Y+00 2Y+00 1E+00 \n 2Y+00 2Z+00 1E+00')
答案 1 :(得分:0)
给出:
$ cat file.txt
...some lines before this...
MY TEST MATRIX (ROWS)
0.5056E+03 0.8687E-03 -0.1202E-02
0.5056E+03 0.8687E-03 -0.1202E-02
MY TEST END
0.5056E+03 0.8687E-03 -0.1202E-02
0.3776E+03 0.8687E-03 0.1975E-04
STOP
---some lines after this
MY TEST MATRIX (ROWS)
2E+04 2E+04 0.8687E-03
2E+04 2E+04 0.8687E-03
MY TEST END
0.5056E+03 0.8687E-03 -0.1202E-02
0.5056E+03 0.8687E-03 -0.1202E-02
STOP
---some lines after this
---this repeats in txt file----
在sed
,perl
或awk
中,您具有范围正则表达式的概念,可以按照以下方式进行操作:
$ sed -nE '/^MY TEST MATRIX/,/^MY TEST END/p' file.txt
MY TEST MATRIX (ROWS)
0.5056E+03 0.8687E-03 -0.1202E-02
0.5056E+03 0.8687E-03 -0.1202E-02
MY TEST END
MY TEST MATRIX (ROWS)
2E+04 2E+04 0.8687E-03
2E+04 2E+04 0.8687E-03
MY TEST END
您可以使用FlipFlop类在Python中复制此功能:
class FlipFlop:
''' Class to imitate the bahavior of /start/, /end/ flip flop in awk '''
def __init__(self, start_pattern, end_pattern):
self.patterns = start_pattern, end_pattern
self.state = False
def __call__(self, st):
ms=[e.search(st) for e in self.patterns]
if all(m for m in ms):
self.state = False
return True
rtr=True if self.state else False
if ms[self.state]:
self.state = not self.state
return self.state or rtr
然后在逐行读取文件时捕获块:
di={}
blocks=[FlipFlop(re.compile(r'^MY TEST MATRIX \(ROWS\)'), re.compile(r'^MY TEST END')),
FlipFlop(re.compile(r'^MY TEST END'), re.compile(r'^STOP'))]
for i, ff in enumerate(blocks):
with open(fn) as f:
di[i]=[line.strip() for line in f if ff(line)]
结果:
>>> di
{0: ['MY TEST MATRIX (ROWS)',
'0.5056E+03 0.8687E-03 -0.1202E-02',
'0.5056E+03 0.8687E-03 -0.1202E-02',
'MY TEST END',
'MY TEST MATRIX (ROWS)',
'2E+04 2E+04 0.8687E-03',
'2E+04 2E+04 0.8687E-03',
'MY TEST END'],
1: ['MY TEST END',
'0.5056E+03 0.8687E-03 -0.1202E-02',
'0.3776E+03 0.8687E-03 0.1975E-04',
'STOP',
'MY TEST END',
'0.5056E+03 0.8687E-03 -0.1202E-02',
'0.5056E+03 0.8687E-03 -0.1202E-02',
'STOP']}
这确实读取了两次文件以节省内存;如果速度更重要,则可以将文件读入内存并对其进行迭代。