我有一个输入文件,如下所示
PATTERN1 PTR1 blah blah blah
needThis blah blah blah
thisOneAsWell blah blah blah
PATTERN2
PATTERN1 PTR2 blah blah blah
needThis blah blah blah
thisOneAsWell blah blah blah
PATTERN2
............................
............................
PATTERN1 PTRN blah blah
needThis blah blah blah
thisOneAsWell blah blah blah
PATTERN2
我需要我的函数只返回PATTERN1到PATTERN2的第一列条目,如下所示,
PTR1
needThis thisOneAsWell
PTR2
needThis thisOneAsWell
......................
......................
PTRN
needThis thisOneAsWell
PTR1,PTR2 ...... PTRN是不同的文本。 PATTERN1& PATTERN2不同但始终存在于文件中。
我如何在Python中实现这一目标?
我仍然是Python的初学者,我正在尝试实现这一点,使用re.findall()没有获得所需的o / p:
def retrieve():
file = open("fileName","r")
string = re.findall(r"PATTERN1",file.read())
print string
答案 0 :(得分:0)
import re
with open('file', 'r') as f:
content = f.read()
matches = re.findall(r'PATTERN1(.*?)PATTERN2', content, re.MULTILINE|re.DOTALL)
for match in matches:
for line in match.split('\n'):
columns = line.split()
if columns:
print(columns[0])
答案 1 :(得分:0)
你可以嵌套两个正则表达式:
txt='''\
PATTERN1 PTR1 blah blah blah
needThis1 blah blah blah
thisOneAsWell1 blah blah blah
PATTERN2
PATTERN1 PTR2 blah blah blah
needThis2 blah blah blah
thisOneAsWell2 blah blah blah
PATTERN2
............................
............................
PATTERN1 PTRN blah blah
needThisN blah blah blah
thisOneAsWellN blah blah blah
PATTERN2'''
import re
for m in re.finditer(r'^PATTERN1\s*(.*?)(?=^PATTERN2)', txt, re.M | re.S):
print re.findall(r'(^\w+)', m.group(1), re.M)
打印:
['PTR1', 'needThis1', 'thisOneAsWell1']
['PTR2', 'needThis2', 'thisOneAsWell2']
['PTRN', 'needThisN', 'thisOneAsWellN']
编辑1
如果您使用的文件很容易适合内存:
with open(fn) as f:
txt=f.read()
for m in re.finditer(r'^PATTERN1\s*(.*?)(?=^PATTERN2)', txt, re.M | re.S):
print re.findall(r'(^\w+)', m.group(1), re.M)
将mmap用于不容易放入内存的较大文件。
编辑2
将结果连接成一个字符串后,只需将结果附加到列表中:
with open(fn) as f:
results=[]
txt=f.read()
for m in re.finditer(r'^PATTERN1\s*(.*?)(?=^PATTERN2)', txt, re.M | re.S):
results.append('\n'.join(re.findall(r'(^\w+)', m.group(1), re.M))
print '\n===\n'.join(results)