re.capture匹配相同模式的多个组python

时间:2018-05-21 05:30:55

标签: python regex python-3.x pandas

我正在尝试从数据框列的时间戳之间捕获内容。

数据框列中的数据包括时间戳,后跟文本,然后是单个或多个新行字符,后跟文本等等。

我的目标是捕获列中的所有文本,以时间戳分隔。

我已经能够通过以下模式搜索捕获第一组文本,但希望重复相同或更好的方法来捕获列中的整个文本。

我的目标是捕捉所附图像中突出显示的文字。

我使用了以下模式搜索,并且能够捕获第一组匹配。

pattern=re.compile(r'(^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2})(\s[-].*\n)(\D*[.])')

输出:通过应用程序事务成功完成作业事务。

要搜索的文字是

  

1997-09-01 12:30:14 - ABCD(补充评论)完成工作交易   成功通过申请交易。 1997-09-01 11:46:22 -   EFGH(附加评论)案例集。团队跟进支持   解决。 1997-09-01 09:15:00 - ABC(附加评论)   确认。这不会影响应用程序功能。那是个   一个工作被执行。我们需要与团队讨论这个问题   检查日志以调查问题。这应该改为   “低”的严重程度,因为工作可以在一天中的任何时间重新运行。

2 个答案:

答案 0 :(得分:0)

虽然可以使用捕获组执行此操作,但我会发现更容易做到这样的事情:

import re

sample="""1997-09-01 12:30:14 - ABCD
Job transactions done successfully through application transactions

1997-90-01 09:15:00 - ABC
Acknowledged

This does not impact functionality.
"""

def date_match(s):
    """Returns true if the beginning of this string matches a date and time."""
    return bool(re.match("\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}", s))

def yeild_matches(full_log):
    log = [] # keep track of this log
    for line in full_log.split("\n"): # for each line
        if date_match(line): # if this line starts with a date
            if len(log) > 0: # if theres already a log...
                # remove the first line (which included the date), and include this line of the log if its not an empty line
                lines = [l for l in log[1:] if l.strip()]
                yield "\n".join(lines) # yield the log
                log = [] # ... and set the log back to nothing.

        log.append(line) # add the current line to log (list)

    yield "\n".join([l for l in log[1:] if l.strip()]) # return the last log (theres no date at the end of the logs to end the last log)

logs = list(yeild_matches(sample))

for i, l in enumerate(logs, 1):
    print("Match {}:\n{}".format(i, l))

输出:

Match 1:
Job transactions done successfully through application transactions
Match 2:
Ackownledged
This does not impact functionality.

答案 1 :(得分:0)

您可以使用positive lookbehind and a positive lookahead检查时间戳格式:

[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}

注意该格式不会使其成为有效的日期和时间。

(?<=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} - ).*?(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} - |$)

python中的另一个选项可能是拆分时间戳格式:

import re
s = """1997-09-01 12:30:14 - ABCD (Additional comments) Job transactions done successfully through application transactions. 1997-09-01 11:46:22 - EFGH (Additional comments) Case set. Team to follow up with Support for resolution. 1997-09-01 09:15:00 - ABC (Additional comments) Acknowledged. This does not impact application functionality. It was a one off job executed . We will need to discuss this with Team and check the logs to investigate the issue. This should be changed to 'low' severity because the job can be re-run at any time of the day."""

print(filter(None, re.split(r'[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} - ', s)))

Demo