Question

我想逐行迭代文本文件，然后搜索Pattern并从其中提取实体。但是，提取的几个模式具有多行特征，当我逐行迭代时会丢失。

现在，我正在使用try-except块并将下一行追加到当前行，例如：

try:
    id_value, utterance, prediction = process(line + ' ' + lines[n + 1])
except AttributeError:
    # Handle bad data
    try:
        id_value, utterance, prediction = process(line + ' ' + lines[n + 1] + ' ' + lines[n + 2])
    except AttributeError:
        # Handle bad data
        try:
            id_value, utterance, prediction = process(
                line + ' ' + lines[n + 1] + ' ' + lines[n + 2] + ' ' + lines[n + 3])

以下是数据：

data.txt

[22 Aug 2019 13:25:12] [ID:9ea1566460506294]     INFO [139921763325696] (ModelClassification:056) - Model classification for utterance_1 is 1
[22 Aug 2019 13:26:06] [ID:7ea1566460117776]     INFO [139921771718400] (ModelClassification:056) - Model classification for  utterance_2
 is 1
[22 Aug 2019 13:26:16] [ID:71d1566460492762]     INFO [139921771718400] (ModelClassification:056) - Model classification for utterance_3 is 0

如您所见

[22 Aug 2019 13:26:06] [ID:7ea1566460117776]     INFO [139921771718400] (ModelClassification:056) - Model classification for  utterance_2
 is 1

在逐行迭代的同时扩展2行。

代码

import re

matching_string = 'Model classification for'
id_start_string = '[ID:'
id_end_string = ']'


def process(line):
    start_idx = line.find(id_start_string)
    end_idx = [s.start() for s in re.finditer(id_end_string, line)]
    for end in end_idx:
        if end > start_idx:
            # Get first index greater than start string index
            end_idx = end
            break
    id_value = line[start_idx + len(id_start_string): end_idx]
    groups = re.search('Model classification for (.*) is (0|1)', line).groups()
    utterance = groups[0]
    prediction = groups[1]
    return id_value, utterance, prediction


with open('data.txt', 'r') as f:
    lines = f.read().splitlines()
    for n, line in enumerate(lines):
        # Search for pattern in string
        if matching_string in line:
            try:
                id_value, utterance, prediction = process(line)
            except AttributeError:
                 print('Bad data')
                 print(line)
            print(id_value, utterance, prediction)

可以对我的问题进行递归解决吗？任何帮助将不胜感激。

编辑-

lines = ['22 Aug 2019 13:25:12] [ID:9ea1566460506294]     INFO [139921763325696] (ModelClassification:056) - Model classification for utterance_1 is 1', '[22 Aug 2019 13:26:06] [ID:7ea1566460117776]     INFO [139921771718400] (ModelClassification:056) - Model classification for  utterance_2', ' is 1', '[22 Aug 2019 13:26:16] [ID:71d1566460492762]     INFO [139921771718400] (ModelClassification:056) - Model classification for utterance_3 is 0 ']

Answer 1

如果要在文件中查找一行。您可以为此使用re.findall（）

import re
with open("input.txt", "r") as f:
    text = f.read()

output = re.findall(r'some regex pattern', text)
output1 = re.findall(r'some other pattern', text)
output2 = re.findall(r'another pattern', text)

with open("output.txt", "w") as f:
    f.write(output)
    f.write(output1)
    f.write(output2)

如果要递归执行，则可以重新查找听起来像您所需的声音。

Answer 2

如果只想使用换行符捕获，则可以修改正则表达式以接受可能的换行符（空格）

r'Model classification for (.*)\s? is (0|1)'

使用re.findall在整个文件中运行

Answer 3

要回答最初的问题（并且不考虑process的实际作用），请对逐渐增大的组合进行迭代：

value = line
for extra in lines[n+1:]:
    value = value + " " + extra
    try:
        id_value, utterance, prediction = process(value)
        break
    except AttributeError:
        pass

Answer 4

我将为这个问题写我自己的解决方案。我在应用中遇到了类似的情况。作为输入，将使用您的样本日志。

比方说，我们有一个包含日志的文件（我让它们有些复杂）：

[22 Aug 2019 13:25:12] [ID:9ea1566460506294]     INFO [139921763325696] 
(ModelClassification:056) - Mod
el classification for utterance_1 is 1
[22 Aug 2019 13:26:06] [ID:7ea1566460117776]     INFO [13992177
1718400] (ModelClassification:056) - Model classificat
ion for  utterance_2
 is 1
[22 Aug 2019 13:26:16] [ID:71d1566460492762]     INFO [139921771718400] (ModelC
lassification:056) - Model classification for utterance_3 is 0

现在，我的目标是收集单个日志。单个日志是从数据开始并以另一行（从下一个数据开始）结束的内容。（该文件包含很多单个日志）当我正确解析单个日志时，我可以找到正则表达式。

代码：

import re

START_LINE_REGEX = re.compile(r'^\[\d+')
MAIN_MATCHER = re.compile(r'(\[ID:\w+\]).* Model classification for (.*) is (0|1)')

def read_file(file_path):
    """
    Read file from path, and return iterator.
    """
    with open(file_path, 'r') as f:
        return iter(f.read().splitlines())

def verify_line(line):
    """
    Check if line starts with proper regex. 
    """
    return True if START_LINE_REGEX.match(line) else False

def single_log(iterator):
    """
    Generator, parse log.
    """
    content = [next(iterator)]
    for line in iterator:
        state = verify_line(line)
        if state:
            yield "".join(content)
            content = [line]
        else:
            content.append(line)
    yield "".join(content)

def get_patterns(log):
    """
    Read values from given regex and a one, big line ( a single log )
    """
    matcher = MAIN_MATCHER.search(log)
    if matcher:
        return matcher.group(1), matcher.group(2), matcher.group(3)
    else:
        print("Could not get groups from '{}'".format(log))


if __name__ == '__main__':
    iterator = read_file('stackoverflow.log')

    gen = single_log(iterator)
    for index, log in enumerate(gen):
        print("{}: {}".format(index, log))
        print("Found regexes: {}".format(get_patterns(log)))

结果：

0: [22 Aug 2019 13:25:12] [ID:9ea1566460506294]     INFO [139921763325696] 
(ModelClassification:056) - Model classification for utterance_1 is 1
Found regexes: ('[ID:9ea1566460506294]', 'utterance_1', '1')
1: [22 Aug 2019 13:26:06] [ID:7ea1566460117776]     INFO [139921771718400]         
(ModelClassification:056) - Model classification for  utterance_2 is 1
Found regexes: ('[ID:7ea1566460117776]', ' utterance_2', '1')
2: [22 Aug 2019 13:26:16] [ID:71d1566460492762]     INFO [139921771718400]         
(ModelClassification:056) - Model classification for utterance_3 is 0
Found regexes: ('[ID:71d1566460492762]', 'utterance_3', '0')

Ofc取决于启动日志格式，但是如果您改进正则表达式，我相信它会比在列表中与索引跳舞更有价值。

Python递归执行try，但条件满足时除外

4 个答案: