使用" Re"在文本中查找重复的模式蟒蛇

时间:2016-03-01 13:52:17

标签: python regex

在下面的示例中,有人会帮助我吗(如果我使用re.DOTALL,它会一直读到文件末尾):

import re

text = "Found to A from:\n\t-B\n\t-C\nFound to K from:\n\t-B\n\t-D\n\t-E\nFound to A from:\n\t-D\nMax time: 20s"

names = ['A', 'K']
for name in names:
    print name
    print re.findall("Found to {0} from:\n\t\-(.+)".format(name), text)

TEXT就像:

enter image description here

输出:

A

['B', 'D']

K

['B']

所需的输出:

A

['B', 'C', 'D']

K

['B', 'D', 'E']

3 个答案:

答案 0 :(得分:4)

这是另一种方法(Python 2.7x):

import re
text = 'Found to A from:\n\t-B\n\t-C\nFound to K from:\n\t-B\n\t-D\n\t-E\nFound to A from:\n\t-D\nMax time: 20s'
for name in ['A', 'K']:
    print name
    print [ n for i in re.findall('(?:Found to ' + name + ' from:)(?:\\n\\t-([A-Z]))(?:\\n\\t-([A-Z]))?(?:\\n\\t-([A-Z]))?', text) for n in i if n ]

输出:

A
['B', 'C', 'D']
K
['B', 'D', 'E']

UPDATE 如果您不知道有多少(?:\ n \ t - ([A-Z])),我建议采用以下方法:

import re
text = 'Found to A from:\n\t-B\n\t-C\n\t-G\nFound to K from:\n\t-B\n\t-D\n\t-E\nFound to A from:\n\t-D\nMax time: 20s'
for name in ['A', 'K']:
    print name
    groups = re.findall('(?:Found to ' + name + ' from:)((?:\\n\\s*-(?:[A-Z]))+)', text)
    print reduce(lambda i,j: i + j, map(lambda x: re.findall('\n\s*-([A-Z])', x), groups))

输出:

A
['B', 'C', 'G', 'D']
K
['B', 'D', 'E']

答案 1 :(得分:2)

当我输入此答案时,我试图回答您的原始问题,其中您有一个具有要解析的特定内容的文件。我想我的答案仍然适用。如果您有一个字符串,请更改

for line in f:

for line in f.splitlines():

并将字符串而不是文件对象传递给keys_and_values

原始答案:

老实说,我认为这看起来像是一项任务,其中繁重的工作应由发电机完成,并在正则表达式的帮助下完成。

import re
from collections import OrderedDict

def keys_and_values(f):
    # discard any headers
    target = '^\s*Found to [A-Z] from:\s*$'
    for line in f:
        if re.match(target, line.strip()):
            break

    # yield (key, value) tuples
    key = line.strip()[9]
    for line in f:
        line = line.strip()
        if re.match(target, line):
            key = line[9]
        elif line:
            yield (key, line)

result = OrderedDict()
with open('testfile.txt') as f:
    for k,v in keys_and_values(f):
        result.setdefault(k, []).append(v)

for k in result:
    print('{}\n{}\n'.format(k, result[k]))

演示:

$ cat testfile.txt 
some
useless
header
lines

Found to A from:

B

C

Found to K from:

B

D

E

Found to A from:

D
$ python parsefile.py
A
['B', 'C', 'D']

K
['B', 'D', 'E']

答案 2 :(得分:0)

不是通用的,但适用于您的情况并且很简单,并且正在使用您提到的findAll。

import re

text = "Found to A from:\n\t-B\n\t-C\nFound to K from:\n\t-B\n\t-D\n\t-E\nFound to A from:\n\t-D\n"

names = ['A', 'K']
for name in names:
    print name
    test = re.findall("Found to {0} from:\n\t-([A-Z])(\n\t)?-?([A-Z])?(\n\t)?-?([A-Z])?".format(name), text)
    # normalize it
    prettyList = []
    for (a,b,c,d,e) in test:
        prettyList.append(a)
        prettyList.append(c)
        prettyList.append(e)
    print [x for x in prettyList if x]

输出

A
['B', 'C', 'D']
K
['B', 'D', 'E']

我知道有很多案例有3个元素,所以你必须添加额外的匹配。