我正在尝试使用正则表达式获取日志(txt文件)的一部分,但我需要一些帮助。基本上日志是这样的:
Tue Feb 24 17:51:10.835 SRV02 NOTICE Event Loop - noop
Tue Feb 24 17:51:10.835 SRV02 NOTICE Exponential histogram:
Tue Feb 24 17:51:10.835 SRV02 NOTICE hist[ 0]: < 0.001: 728941854
Tue Feb 24 17:51:10.835 SRV02 NOTICE Event Loop - noop: samples: 728941854; avg: 0.00; min: 0.00; max: 0.00
Tue Feb 24 17:51:10.835 SRV02 NOTICE Data Quality Monitor Thread Processing Time
Tue Feb 24 17:51:10.835 SRV02 NOTICE Exponential histogram:
Tue Feb 24 17:51:10.835 SRV02 NOTICE hist[ 4]: < 0.016: 3
Tue Feb 24 17:51:10.835 SRV02 NOTICE hist[ 5]: < 0.032: 23
Tue Feb 24 17:51:10.835 SRV02 NOTICE hist[ 6]: < 0.064: 14
Tue Feb 24 17:51:10.835 SRV02 NOTICE hist[ 7]: < 0.128: 4
Tue Feb 24 17:51:10.835 SRV02 NOTICE hist[ 8]: < 0.256: 6
Tue Feb 24 17:51:10.835 SRV02 NOTICE hist[ 9]: < 0.512: 1
Tue Feb 24 17:51:10.835 SRV02 NOTICE hist[10]: < 1.024: 2
Tue Feb 24 17:51:10.835 SRV02 NOTICE Data Quality Monitor Thread Processing Time: samples: 53; avg: 0.08; min: 0.01; max: 0.67
Tue Feb 24 17:51:10.835 SRV02 NOTICE Client Hugepage Memory: 649/4096 MB
Tue Feb 24 17:51:10.836 SRV02 NOTICE DQM: Num R: 0 RD: 0 ED: 0 W: 0 WH: 0 Q: 0 D: 0 DF: 0
Tue Feb 24 17:51:10.836 SRV02 NOTICE Num G: 0 M: 0 S: 0 D: 0 U: 0 R: 0 N: 0
Tue Feb 24 17:51:10.836 SRV02 NOTICE num_template_allocs = 4
Tue Feb 24 17:51:10.836 SRV02 NOTICE num_template_frees = 0
Tue Feb 24 17:51:10.836 SRV02 NOTICE num_internal_book_allocs = 24
我需要获取有关“指数直方图”的信息,因此,在此示例中,我需要识别字符串“指数直方图”并获取所有“hist [...”以导入电子表格。我也需要这些信息:
samples: XX; avg: X.XX; min: X.XX; max: X.XX
因此,在上面的示例中,我需要提取和重新排列这样的数据,其中“Event Loop - noop”和“Data Quality Monitor Thread Processing Time”需要在每一行中重复才能识别直方图:
Event Loop - noop;hist[ 0];0.001;728941854
Event Loop - noop;samples;728941854;avg;0.00;min;0.00;max;0.00
Data Quality Monitor Thread Processing Time;hist[ 4];0.016;3
Data Quality Monitor Thread Processing Time;hist[ 5];0.032;23
Data Quality Monitor Thread Processing Time;hist[ 6];0.064;14
(...)
Data Quality Monitor Thread Processing Time;hist[ 10];1.024;2
Data Quality Monitor Thread Processing Time;samples;53;avg;0.08;min;0.01;max;0.67
有人可以帮我怎么做?谢谢!
答案 0 :(得分:1)
在示例输出中,您拥有样本输入中不存在的数据。具体来说,您的数据中包含更多"Data Quality Monitor Thread Processing Time"
个字符串。看来你想保留最近的缩进标题?
无论如何,我认为使用几个不同的正则表达式语句来提取数据会更容易,而不是试图让一个包含所有数据:
import re
hists = re.findall(r'(hist\[\s\d+\]).*?(\d+\.\d+).*?(\d+)',input)
sample_avg_etc = re.findall(r'(samples): (\d+); (avg): (\d+\.\d+); (min): (\d+\.\d+); (max): (\d+\.\d+)',input)
如果您需要保留示例输出中显示的本地标题。我不认为你想使用正则表达式。相反,只需编写一个解析器来提取数据。
您可以通过剥离其Tue Feb 24 17:51:10.835 SRV02 NOTICE
的每一行开始此操作,然后逐行定位数据,跟踪最后一个标头。请参阅注释,下面将返回您上面列出的内容:
import re
def parse(data):
lines = data.split('\n') # get the lines by splitting on the newline char
lines = [line[len("Tue Feb 24 17:51:10.835 SRV02 NOTICE "):] for line in lines] # remove the number of characters equal to the logging info
out = []
header = ''
for line in lines:
if line.startswith(' '):
if line.strip().startswith('hist'):
out.append(header + ";" + extract_hist_data(line)) # outsource the specific extracting to a function for ease of readability
else: # header/samples line
if all(i in line for i in ("samples", "avg", "min", "max")): # if the line contains all these keywords
out.append(header + ";" + extract_stat_data(line)) # outsource the specific extracting to a function for ease of readability
else: # Treat as a header
header = line
return '\n'.join(out)
def extract_hist_data(line):
data = re.findall(r'(hist\[\s*?\d+\]).*?(\d+\.\d+).*?(\d+)',line)
if len(data) > 0:
data = data[0]
else:
return ""
return ';'.join(i for i in data)
def extract_stat_data(line):
data = re.findall(r'(samples).*?(\d+).*?(avg).*?(\d+\.\d+).*?(min).*?(\d+\.\d+).*?(max).*?(\d+\.\d+)',line)
if len(data) > 0:
data = data[0]
else:
return ""
return ';'.join(i for i in data)
def parse_log_file(log_file_path):
with open(log_file_path,'r') as f:
content = ''.join(i for i in f)
return parse(content)
print parse_log_file('test.log')