从Pattern1到Pattern2检索文本 - Python

时间:2014-06-26 16:22:21

标签: python regex python-2.7

我有一个输入文件,如下所示

PATTERN1 PTR1 blah blah blah
needThis  blah blah blah
thisOneAsWell  blah blah blah
PATTERN2

PATTERN1 PTR2 blah blah blah
needThis  blah blah blah
thisOneAsWell  blah blah blah
PATTERN2 

............................
............................

PATTERN1  PTRN blah blah
needThis  blah blah blah
thisOneAsWell blah blah blah
PATTERN2

我需要我的函数只返回PATTERN1到PATTERN2的第一列条目,如下所示,

PTR1
needThis thisOneAsWell

PTR2
needThis thisOneAsWell

......................
......................
PTRN
needThis thisOneAsWell

PTR1,PTR2 ...... PTRN是不同的文本。 PATTERN1& PATTERN2不同但始终存在于文件中。

我如何在Python中实现这一目标?

我仍然是Python的初学者,我正在尝试实现这一点,使用re.findall()没有获得所需的o / p:

def retrieve():
    file = open("fileName","r")
    string = re.findall(r"PATTERN1",file.read())
    print string

2 个答案:

答案 0 :(得分:0)

import re
with open('file', 'r') as f:
    content = f.read()
    matches = re.findall(r'PATTERN1(.*?)PATTERN2', content, re.MULTILINE|re.DOTALL)

for match in matches:
    for line in match.split('\n'):
        columns = line.split()
        if columns:
            print(columns[0])

答案 1 :(得分:0)

你可以嵌套两个正则表达式:

txt='''\
PATTERN1 PTR1 blah blah blah
needThis1  blah blah blah
thisOneAsWell1  blah blah blah
PATTERN2

PATTERN1 PTR2 blah blah blah
needThis2  blah blah blah
thisOneAsWell2  blah blah blah
PATTERN2 

............................
............................

PATTERN1  PTRN blah blah
needThisN  blah blah blah
thisOneAsWellN blah blah blah
PATTERN2'''

import re

for m in re.finditer(r'^PATTERN1\s*(.*?)(?=^PATTERN2)', txt, re.M | re.S):
    print re.findall(r'(^\w+)', m.group(1), re.M)

打印:

['PTR1', 'needThis1', 'thisOneAsWell1']
['PTR2', 'needThis2', 'thisOneAsWell2']
['PTRN', 'needThisN', 'thisOneAsWellN']

编辑1

如果您使用的文件很容易适合内存:

with open(fn) as f:
    txt=f.read()
    for m in re.finditer(r'^PATTERN1\s*(.*?)(?=^PATTERN2)', txt, re.M | re.S):
        print re.findall(r'(^\w+)', m.group(1), re.M)

mmap用于不容易放入内存的较大文件。


编辑2

将结果连接成一个字符串后,只需将结果附加到列表中:

with open(fn) as f:
    results=[]
    txt=f.read()
    for m in re.finditer(r'^PATTERN1\s*(.*?)(?=^PATTERN2)', txt, re.M | re.S):
        results.append('\n'.join(re.findall(r'(^\w+)', m.group(1), re.M))
    print '\n===\n'.join(results)