如何从匹配中获取文本直到相同模式的下一个匹配?
我有一个这样的日志文件:
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
我可以找到2个第一行但是我无法获得其他行直到下一个匹配。 所以我得到了: INFO1:BLAH INFO2:BLAH
但我想要像这样的extrac团体:
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
我试过这个:
start_exec_ptrn = r'INFO1: .+\nINFO2: .+'
last_exec_start = last_exec_end = 0
for m in re.finditer(start_exec_ptrn, log_content):
start_exec = m.start()
end_exec = m.end()
print start_exec, '-', end_exec
print log_content[last_exec_end:end_exec]
last_exec_start = start_exec
last_exec_end = end_exec
print 150 * '*'
提前致谢,谢谢我的英语!
答案 0 :(得分:1)
下面:
>>> import re
>>> separator = "INFO1: BLAH\nINFO2: BLAH\n"
>>> map(lambda(p): "%s%s" % (separator, p), re.split(r'%s.*' % separator, all_text)[1:])
这将完全返回您要查找的内容:
['INFO1: BLAH\nINFO2: BLAH\nSOMETHING RELATED TO THE INFO1 AND INFO2\nSOMETHING DIFFERENT
RELATED TO THE INFO1 AND INFO2\nSOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2\nSOME
THING ALSO RELATED TO THE INFO1 AND INFO2\n', 'INFO1: BLAH\nINFO2: BLAH\nSOMETHING RELATE
D TO THE INFO1 AND INFO2\nSOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2\nSOMETHING O
THER WAY RELATED TO THE INFO1 AND INFO2\nSOMETHING ALSO RELATED TO THE INFO1 AND INFO2\n'
, 'INFO1: BLAH\nINFO2: BLAH\nSOMETHING RELATED TO THE INFO1 AND INFO2\nSOMETHING DIFFEREN
T RELATED TO THE INFO1 AND INFO2\nSOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2\nSOM
ETHING ALSO RELATED TO THE INFO1 AND INFO2\n']
答案 1 :(得分:0)
答案 2 :(得分:0)
你可以在没有正则表达式的情况下完成
with open('file.log') as f:
data = f.readlines()
matches, headers, sec = [], [], []
for i, line in enumerate(data):
if not line:
continue
line_lower = line.lower()
if line_lower.startswith('info'):
if not data[i - 1].lower().startswith('info'):
if headers and sec:
matches.append({'headers': headers, 'matches': sec})
headers, sec = [], []
head = line_lower.split(':')[0]
headers.append(head)
continue
if any(x in line_lower for x in headers):
sec.append(line)
print matches
#[{'headers': ['info1', 'info2'], 'matches': ['SOMETHING RELATED TO THE INFO1 AND INFO2', 'SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2', 'SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2', 'SOMETHING ALSO RELATED TO THE INFO1 AND INFO2']}, {'headers': ['info1', 'info2'], 'matches': ['SOMETHING RELATED TO THE INFO1 AND INFO2', 'SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2', 'SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2', 'SOMETHING ALSO RELATED TO THE INFO1 AND INFO2']}]
答案 3 :(得分:0)
要检索包含INFO1或INFO2的所有行,正则表达式模式应为:
^.*\b(INFO1|INFO2)\b.*$
Hople帮助了你!
答案 4 :(得分:0)
如何使用split()
?
假设您将文本分配给string
,您可以这样做:
separator = "INFO1: BLAH\nINFO2: BLAH"
result = ''.join(string.split(separator)[1])
print('{0}\n{1}'.format(separator, result)
答案 5 :(得分:0)
如果这些部分始终以INFO
开头,则可以使用groupby:
from itertools import groupby
with open("in.txt") as f:
grps = groupby(f, key=lambda x: x.startswith(("INFO1:","INFO2:")))
for k,v in grps:
if k:
print("".join((v)) + "".join((next(grps,["",""])[1])))
输出:
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
INFO1: BLAH
INFO2: BLAH
SOMETHING RELATED TO THE INFO1 AND INFO2
SOMETHING DIFFERENT RELATED TO THE INFO1 AND INFO2
SOMETHING OTHER WAY RELATED TO THE INFO1 AND INFO2
SOMETHING ALSO RELATED TO THE INFO1 AND INFO2
答案 6 :(得分:-1)
你应该检查findall()调用字符串
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
# do something with each found email string
print email