Python递归执行try,但条件满足时除外

时间:2019-08-25 12:46:34

标签: python list csv text

我想逐行迭代文本文件,然后搜索Pattern并从其中提取实体。但是,提取的几个模式具有多行特征,当我逐行迭代时会丢失。

现在,我正在使用try-except块并将下一行追加到当前行,例如:

try:
    id_value, utterance, prediction = process(line + ' ' + lines[n + 1])
except AttributeError:
    # Handle bad data
    try:
        id_value, utterance, prediction = process(line + ' ' + lines[n + 1] + ' ' + lines[n + 2])
    except AttributeError:
        # Handle bad data
        try:
            id_value, utterance, prediction = process(
                line + ' ' + lines[n + 1] + ' ' + lines[n + 2] + ' ' + lines[n + 3])

以下是数据:

data.txt

[22 Aug 2019 13:25:12] [ID:9ea1566460506294]     INFO [139921763325696] (ModelClassification:056) - Model classification for utterance_1 is 1
[22 Aug 2019 13:26:06] [ID:7ea1566460117776]     INFO [139921771718400] (ModelClassification:056) - Model classification for  utterance_2
 is 1
[22 Aug 2019 13:26:16] [ID:71d1566460492762]     INFO [139921771718400] (ModelClassification:056) - Model classification for utterance_3 is 0 

如您所见

[22 Aug 2019 13:26:06] [ID:7ea1566460117776]     INFO [139921771718400] (ModelClassification:056) - Model classification for  utterance_2
 is 1

在逐行迭代的同时扩展2行。

代码

import re

matching_string = 'Model classification for'
id_start_string = '[ID:'
id_end_string = ']'


def process(line):
    start_idx = line.find(id_start_string)
    end_idx = [s.start() for s in re.finditer(id_end_string, line)]
    for end in end_idx:
        if end > start_idx:
            # Get first index greater than start string index
            end_idx = end
            break
    id_value = line[start_idx + len(id_start_string): end_idx]
    groups = re.search('Model classification for (.*) is (0|1)', line).groups()
    utterance = groups[0]
    prediction = groups[1]
    return id_value, utterance, prediction


with open('data.txt', 'r') as f:
    lines = f.read().splitlines()
    for n, line in enumerate(lines):
        # Search for pattern in string
        if matching_string in line:
            try:
                id_value, utterance, prediction = process(line)
            except AttributeError:
                 print('Bad data')
                 print(line)
            print(id_value, utterance, prediction)

可以对我的问题进行递归解决吗?任何帮助将不胜感激。

编辑-

lines = ['22 Aug 2019 13:25:12] [ID:9ea1566460506294]     INFO [139921763325696] (ModelClassification:056) - Model classification for utterance_1 is 1', '[22 Aug 2019 13:26:06] [ID:7ea1566460117776]     INFO [139921771718400] (ModelClassification:056) - Model classification for  utterance_2', ' is 1', '[22 Aug 2019 13:26:16] [ID:71d1566460492762]     INFO [139921771718400] (ModelClassification:056) - Model classification for utterance_3 is 0 ']

4 个答案:

答案 0 :(得分:1)

如果要在文件中查找一行。您可以为此使用re.findall()

import re
with open("input.txt", "r") as f:
    text = f.read()

output = re.findall(r'some regex pattern', text)
output1 = re.findall(r'some other pattern', text)
output2 = re.findall(r'another pattern', text)

with open("output.txt", "w") as f:
    f.write(output)
    f.write(output1)
    f.write(output2)

如果要递归执行,则可以重新查找听起来像您所需的声音。

答案 1 :(得分:0)

如果只想使用换行符捕获,则可以修改正则表达式以接受可能的换行符(空格)

r'Model classification for (.*)\s? is (0|1)'

使用re.findall在整个文件中运行

答案 2 :(得分:0)

要回答最初的问题(并且不考虑process的实际作用),请对逐渐增大的组合进行迭代:

value = line
for extra in lines[n+1:]:
    value = value + " " + extra
    try:
        id_value, utterance, prediction = process(value)
        break
    except AttributeError:
        pass

答案 3 :(得分:0)

我将为这个问题写我自己的解决方案。我在应用中遇到了类似的情况。 作为输入,将使用您的样本日志。

比方说,我们有一个包含日志的文件(我让它们有些复杂):

[22 Aug 2019 13:25:12] [ID:9ea1566460506294]     INFO [139921763325696] 
(ModelClassification:056) - Mod
el classification for utterance_1 is 1
[22 Aug 2019 13:26:06] [ID:7ea1566460117776]     INFO [13992177
1718400] (ModelClassification:056) - Model classificat
ion for  utterance_2
 is 1
[22 Aug 2019 13:26:16] [ID:71d1566460492762]     INFO [139921771718400] (ModelC
lassification:056) - Model classification for utterance_3 is 0

现在,我的目标是收集单个日志。单个日志是从数据开始并以另一行(从下一个数据开始)结束的内容。 (该文件包含很多单个日志) 当我正确解析单个日志时,我可以找到正则表达式。

代码:

import re

START_LINE_REGEX = re.compile(r'^\[\d+')
MAIN_MATCHER = re.compile(r'(\[ID:\w+\]).* Model classification for (.*) is (0|1)')

def read_file(file_path):
    """
    Read file from path, and return iterator.
    """
    with open(file_path, 'r') as f:
        return iter(f.read().splitlines())

def verify_line(line):
    """
    Check if line starts with proper regex. 
    """
    return True if START_LINE_REGEX.match(line) else False

def single_log(iterator):
    """
    Generator, parse log.
    """
    content = [next(iterator)]
    for line in iterator:
        state = verify_line(line)
        if state:
            yield "".join(content)
            content = [line]
        else:
            content.append(line)
    yield "".join(content)

def get_patterns(log):
    """
    Read values from given regex and a one, big line ( a single log )
    """
    matcher = MAIN_MATCHER.search(log)
    if matcher:
        return matcher.group(1), matcher.group(2), matcher.group(3)
    else:
        print("Could not get groups from '{}'".format(log))


if __name__ == '__main__':
    iterator = read_file('stackoverflow.log')

    gen = single_log(iterator)
    for index, log in enumerate(gen):
        print("{}: {}".format(index, log))
        print("Found regexes: {}".format(get_patterns(log)))

结果:

0: [22 Aug 2019 13:25:12] [ID:9ea1566460506294]     INFO [139921763325696] 
(ModelClassification:056) - Model classification for utterance_1 is 1
Found regexes: ('[ID:9ea1566460506294]', 'utterance_1', '1')
1: [22 Aug 2019 13:26:06] [ID:7ea1566460117776]     INFO [139921771718400]         
(ModelClassification:056) - Model classification for  utterance_2 is 1
Found regexes: ('[ID:7ea1566460117776]', ' utterance_2', '1')
2: [22 Aug 2019 13:26:16] [ID:71d1566460492762]     INFO [139921771718400]         
(ModelClassification:056) - Model classification for utterance_3 is 0
Found regexes: ('[ID:71d1566460492762]', 'utterance_3', '0')

Ofc取决于启动日志格式,但是如果您改进正则表达式,我相信它会比在列表中与索引跳舞更有价值。