我正在尝试从非常大的日志文件中收集特定信息,但无法弄清楚如何获得所需的行为。
作为参考,示例日志有点像这样:
garbage I don't need - garbage I don't need timestamp - date - server info - 'keyword 1' - data more data more data more data more data more data more data more data more data more data more data 'keyword 2' - last bit of data garbage I don't need - garbage I don't need
我需要找到“关键字1”,抓住整行关键字1处于打开状态(返回时间戳记),然后捕获所有后续行,直到(并包括)“关键字2”处于打开状态的整个行(直到最后一行)位数据)。
到目前为止,我已经尝试了一些方法。我无法通过re方法(findall,match,search等)获得不错的结果;我无法弄清楚如何在比赛之前(甚至是回头一看)获取数据,但更重要的是,我无法弄清楚如何使捕获停止在一个短语而不是单个字符上。
for match in re.findall('keyword1[keyword2]+|', showall.read()):
我也尝试过这样的事情:
start_capture = False
for current_line in fileName:
if 'keyword1' in current_line:
start_capture = True
if start_capture:
new_list.append(current_line)
if 'keyword2' in current_line:
return(new_list)
无论我尝试了什么,都会返回一个空列表
最后,我尝试了这样的事情:
def takewhile_plus_next(predicate, xs):
for x in xs:
if not predicate(x):
break
yield x
yield x
with lastdb as f:
lines = map(str.rstrip, f)
skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)
这最后一个将关键字1到EOF的所有内容都包括了近100,000行的垃圾数据。
答案 0 :(得分:1)
如果您指定re.dotall
并使用惰性代码,则可以使用正则表达式。*?匹配开始和结束:
import re
regex = r"\n.*?(keyword 1).*?(keyword 2).*?$"
test_str = ("garbage I don't need - garbage I don't need\n"
"timestamp - date - server info - 'keyword 1' - data\n"
"more data more data more data more data\n"
"more data more data more data more data\n"
"more data more data 'keyword 2' - last bit of data\n"
"garbage I don't need - garbage I don't need")
matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print (match.group()) # your match is the whole group
输出:
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
您可能需要strip('\n')
...
您可以在这里查看它:https://regex101.com/r/HWIALZ/1-它还包含模式的说明。它的简称:
\n newline
.*? as few as possible anythings
(keyword 1) literal text - the () are not needed only if you want the group
.*? as few as possible anythings
(keyword 2) literal text - again () are not needed
.*? as few as possible anythings
$ end of line
为了清楚起见,我包括了()-您不评估组,而是将其删除。
答案 1 :(得分:1)
对于任何大小的文件,以下内容都很快速。它在3秒钟内从250万行日志文件中提取了近200万行。提取的部分位于文件的末尾。
如果您的文件可能无法容纳在可用内存中,我不建议使用list
,正则表达式或其他内存技术。
测试文本文件startstop_text
:
line 1 this should not appear in output
line 2 keyword1
line 3 appears in output
line 4 keyword2
line 5 this should not appear in output
代码:
from itertools import dropwhile
def keepuntil(contains_end_keyword, lines):
for line in lines:
yield line
if contains_end_keyword(line):
break
with open('startstop_text', 'r') as f:
from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
for line in extracted:
print(line.rstrip())
>>> python startstop.py
line 2 keyword1
line 3 appears in output
line 4 keyword2
答案 2 :(得分:-1)
其他响应均无用,但我能够使用正则表达式弄清楚。
for match in re.findall(".*keyword1[\s\S]*?keyword2:[\s\S]*?keyword3.*", log_file.read()):