Question

我正在尝试从数据框列的时间戳之间捕获内容。

数据框列中的数据包括时间戳，后跟文本，然后是单个或多个新行字符，后跟文本等等。

我的目标是捕获列中的所有文本，以时间戳分隔。

我已经能够通过以下模式搜索捕获第一组文本，但希望重复相同或更好的方法来捕获列中的整个文本。

我的目标是捕捉所附图像中突出显示的文字。

我使用了以下模式搜索，并且能够捕获第一组匹配。

pattern=re.compile(r'(^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2})(\s[-].*\n)(\D*[.])')

输出：通过应用程序事务成功完成作业事务。

要搜索的文字是

1997-09-01 12:30:14 - ABCD（补充评论）完成工作交易成功通过申请交易。 1997-09-01 11:46:22 - EFGH（附加评论）案例集。团队跟进支持解决。 1997-09-01 09:15:00 - ABC（附加评论）确认。这不会影响应用程序功能。那是个一个工作被执行。我们需要与团队讨论这个问题检查日志以调查问题。这应该改为 “低”的严重程度，因为工作可以在一天中的任何时间重新运行。

Answer 1

虽然可以使用捕获组执行此操作，但我会发现更容易做到这样的事情：

import re

sample="""1997-09-01 12:30:14 - ABCD
Job transactions done successfully through application transactions

1997-90-01 09:15:00 - ABC
Acknowledged

This does not impact functionality.
"""

def date_match(s):
    """Returns true if the beginning of this string matches a date and time."""
    return bool(re.match("\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}", s))

def yeild_matches(full_log):
    log = [] # keep track of this log
    for line in full_log.split("\n"): # for each line
        if date_match(line): # if this line starts with a date
            if len(log) > 0: # if theres already a log...
                # remove the first line (which included the date), and include this line of the log if its not an empty line
                lines = [l for l in log[1:] if l.strip()]
                yield "\n".join(lines) # yield the log
                log = [] # ... and set the log back to nothing.

        log.append(line) # add the current line to log (list)

    yield "\n".join([l for l in log[1:] if l.strip()]) # return the last log (theres no date at the end of the logs to end the last log)

logs = list(yeild_matches(sample))

for i, l in enumerate(logs, 1):
    print("Match {}:\n{}".format(i, l))

输出：

Match 1:
Job transactions done successfully through application transactions
Match 2:
Ackownledged
This does not impact functionality.

Answer 2

您可以使用positive lookbehind and a positive lookahead检查时间戳格式：

[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}

注意该格式不会使其成为有效的日期和时间。

(?<=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} - ).*?(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} - |$)

python中的另一个选项可能是拆分时间戳格式：

import re
s = """1997-09-01 12:30:14 - ABCD (Additional comments) Job transactions done successfully through application transactions. 1997-09-01 11:46:22 - EFGH (Additional comments) Case set. Team to follow up with Support for resolution. 1997-09-01 09:15:00 - ABC (Additional comments) Acknowledged. This does not impact application functionality. It was a one off job executed . We will need to discuss this with Team and check the logs to investigate the issue. This should be changed to 'low' severity because the job can be re-run at any time of the day."""

print(filter(None, re.split(r'[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} - ', s)))

Demo

re.capture匹配相同模式的多个组python

2 个答案: