我想逐行迭代文本文件,然后搜索Pattern并从其中提取实体。但是,提取的几个模式具有多行特征,当我逐行迭代时会丢失。
现在,我正在使用try-except
块并将下一行追加到当前行,例如:
try:
id_value, utterance, prediction = process(line + ' ' + lines[n + 1])
except AttributeError:
# Handle bad data
try:
id_value, utterance, prediction = process(line + ' ' + lines[n + 1] + ' ' + lines[n + 2])
except AttributeError:
# Handle bad data
try:
id_value, utterance, prediction = process(
line + ' ' + lines[n + 1] + ' ' + lines[n + 2] + ' ' + lines[n + 3])
以下是数据:
data.txt
[22 Aug 2019 13:25:12] [ID:9ea1566460506294] INFO [139921763325696] (ModelClassification:056) - Model classification for utterance_1 is 1
[22 Aug 2019 13:26:06] [ID:7ea1566460117776] INFO [139921771718400] (ModelClassification:056) - Model classification for utterance_2
is 1
[22 Aug 2019 13:26:16] [ID:71d1566460492762] INFO [139921771718400] (ModelClassification:056) - Model classification for utterance_3 is 0
如您所见
[22 Aug 2019 13:26:06] [ID:7ea1566460117776] INFO [139921771718400] (ModelClassification:056) - Model classification for utterance_2
is 1
在逐行迭代的同时扩展2行。
代码
import re
matching_string = 'Model classification for'
id_start_string = '[ID:'
id_end_string = ']'
def process(line):
start_idx = line.find(id_start_string)
end_idx = [s.start() for s in re.finditer(id_end_string, line)]
for end in end_idx:
if end > start_idx:
# Get first index greater than start string index
end_idx = end
break
id_value = line[start_idx + len(id_start_string): end_idx]
groups = re.search('Model classification for (.*) is (0|1)', line).groups()
utterance = groups[0]
prediction = groups[1]
return id_value, utterance, prediction
with open('data.txt', 'r') as f:
lines = f.read().splitlines()
for n, line in enumerate(lines):
# Search for pattern in string
if matching_string in line:
try:
id_value, utterance, prediction = process(line)
except AttributeError:
print('Bad data')
print(line)
print(id_value, utterance, prediction)
可以对我的问题进行递归解决吗?任何帮助将不胜感激。
编辑-
lines = ['22 Aug 2019 13:25:12] [ID:9ea1566460506294] INFO [139921763325696] (ModelClassification:056) - Model classification for utterance_1 is 1', '[22 Aug 2019 13:26:06] [ID:7ea1566460117776] INFO [139921771718400] (ModelClassification:056) - Model classification for utterance_2', ' is 1', '[22 Aug 2019 13:26:16] [ID:71d1566460492762] INFO [139921771718400] (ModelClassification:056) - Model classification for utterance_3 is 0 ']
答案 0 :(得分:1)
如果要在文件中查找一行。您可以为此使用re.findall()
import re
with open("input.txt", "r") as f:
text = f.read()
output = re.findall(r'some regex pattern', text)
output1 = re.findall(r'some other pattern', text)
output2 = re.findall(r'another pattern', text)
with open("output.txt", "w") as f:
f.write(output)
f.write(output1)
f.write(output2)
如果要递归执行,则可以重新查找听起来像您所需的声音。
答案 1 :(得分:0)
如果只想使用换行符捕获,则可以修改正则表达式以接受可能的换行符(空格)
r'Model classification for (.*)\s? is (0|1)'
使用re.findall在整个文件中运行
答案 2 :(得分:0)
要回答最初的问题(并且不考虑process
的实际作用),请对逐渐增大的组合进行迭代:
value = line
for extra in lines[n+1:]:
value = value + " " + extra
try:
id_value, utterance, prediction = process(value)
break
except AttributeError:
pass
答案 3 :(得分:0)
我将为这个问题写我自己的解决方案。我在应用中遇到了类似的情况。 作为输入,将使用您的样本日志。
比方说,我们有一个包含日志的文件(我让它们有些复杂):
[22 Aug 2019 13:25:12] [ID:9ea1566460506294] INFO [139921763325696]
(ModelClassification:056) - Mod
el classification for utterance_1 is 1
[22 Aug 2019 13:26:06] [ID:7ea1566460117776] INFO [13992177
1718400] (ModelClassification:056) - Model classificat
ion for utterance_2
is 1
[22 Aug 2019 13:26:16] [ID:71d1566460492762] INFO [139921771718400] (ModelC
lassification:056) - Model classification for utterance_3 is 0
现在,我的目标是收集单个日志。单个日志是从数据开始并以另一行(从下一个数据开始)结束的内容。 (该文件包含很多单个日志) 当我正确解析单个日志时,我可以找到正则表达式。
代码:
import re
START_LINE_REGEX = re.compile(r'^\[\d+')
MAIN_MATCHER = re.compile(r'(\[ID:\w+\]).* Model classification for (.*) is (0|1)')
def read_file(file_path):
"""
Read file from path, and return iterator.
"""
with open(file_path, 'r') as f:
return iter(f.read().splitlines())
def verify_line(line):
"""
Check if line starts with proper regex.
"""
return True if START_LINE_REGEX.match(line) else False
def single_log(iterator):
"""
Generator, parse log.
"""
content = [next(iterator)]
for line in iterator:
state = verify_line(line)
if state:
yield "".join(content)
content = [line]
else:
content.append(line)
yield "".join(content)
def get_patterns(log):
"""
Read values from given regex and a one, big line ( a single log )
"""
matcher = MAIN_MATCHER.search(log)
if matcher:
return matcher.group(1), matcher.group(2), matcher.group(3)
else:
print("Could not get groups from '{}'".format(log))
if __name__ == '__main__':
iterator = read_file('stackoverflow.log')
gen = single_log(iterator)
for index, log in enumerate(gen):
print("{}: {}".format(index, log))
print("Found regexes: {}".format(get_patterns(log)))
结果:
0: [22 Aug 2019 13:25:12] [ID:9ea1566460506294] INFO [139921763325696]
(ModelClassification:056) - Model classification for utterance_1 is 1
Found regexes: ('[ID:9ea1566460506294]', 'utterance_1', '1')
1: [22 Aug 2019 13:26:06] [ID:7ea1566460117776] INFO [139921771718400]
(ModelClassification:056) - Model classification for utterance_2 is 1
Found regexes: ('[ID:7ea1566460117776]', ' utterance_2', '1')
2: [22 Aug 2019 13:26:16] [ID:71d1566460492762] INFO [139921771718400]
(ModelClassification:056) - Model classification for utterance_3 is 0
Found regexes: ('[ID:71d1566460492762]', 'utterance_3', '0')
Ofc取决于启动日志格式,但是如果您改进正则表达式,我相信它会比在列表中与索引跳舞更有价值。