我正在尝试从数据框列的时间戳之间捕获内容。
数据框列中的数据包括时间戳,后跟文本,然后是单个或多个新行字符,后跟文本等等。
我的目标是捕获列中的所有文本,以时间戳分隔。
我已经能够通过以下模式搜索捕获第一组文本,但希望重复相同或更好的方法来捕获列中的整个文本。
我的目标是捕捉所附图像中突出显示的文字。
我使用了以下模式搜索,并且能够捕获第一组匹配。
pattern=re.compile(r'(^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2})(\s[-].*\n)(\D*[.])')
输出:通过应用程序事务成功完成作业事务。
要搜索的文字是
1997-09-01 12:30:14 - ABCD(补充评论)完成工作交易 成功通过申请交易。 1997-09-01 11:46:22 - EFGH(附加评论)案例集。团队跟进支持 解决。 1997-09-01 09:15:00 - ABC(附加评论) 确认。这不会影响应用程序功能。那是个 一个工作被执行。我们需要与团队讨论这个问题 检查日志以调查问题。这应该改为 “低”的严重程度,因为工作可以在一天中的任何时间重新运行。
答案 0 :(得分:0)
虽然可以使用捕获组执行此操作,但我会发现更容易做到这样的事情:
import re
sample="""1997-09-01 12:30:14 - ABCD
Job transactions done successfully through application transactions
1997-90-01 09:15:00 - ABC
Acknowledged
This does not impact functionality.
"""
def date_match(s):
"""Returns true if the beginning of this string matches a date and time."""
return bool(re.match("\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}", s))
def yeild_matches(full_log):
log = [] # keep track of this log
for line in full_log.split("\n"): # for each line
if date_match(line): # if this line starts with a date
if len(log) > 0: # if theres already a log...
# remove the first line (which included the date), and include this line of the log if its not an empty line
lines = [l for l in log[1:] if l.strip()]
yield "\n".join(lines) # yield the log
log = [] # ... and set the log back to nothing.
log.append(line) # add the current line to log (list)
yield "\n".join([l for l in log[1:] if l.strip()]) # return the last log (theres no date at the end of the logs to end the last log)
logs = list(yeild_matches(sample))
for i, l in enumerate(logs, 1):
print("Match {}:\n{}".format(i, l))
输出:
Match 1: Job transactions done successfully through application transactions Match 2: Ackownledged This does not impact functionality.
答案 1 :(得分:0)
您可以使用positive lookbehind and a positive lookahead检查时间戳格式:
[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}
注意该格式不会使其成为有效的日期和时间。
python中的另一个选项可能是拆分时间戳格式:
import re
s = """1997-09-01 12:30:14 - ABCD (Additional comments) Job transactions done successfully through application transactions. 1997-09-01 11:46:22 - EFGH (Additional comments) Case set. Team to follow up with Support for resolution. 1997-09-01 09:15:00 - ABC (Additional comments) Acknowledged. This does not impact application functionality. It was a one off job executed . We will need to discuss this with Team and check the logs to investigate the issue. This should be changed to 'low' severity because the job can be re-run at any time of the day."""
print(filter(None, re.split(r'[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2} - ', s)))