我有一个.txt文件,其中包含以下格式的请求日志:
time_namelookup: 0,121668
time_connect: 0,460643
time_pretransfer: 0,460755
time_redirect: 0,000000
time_starttransfer: 0,811697
time_total: 0,811813
-------------
time_namelookup: 0,121665
time_connect: 0,460643
time_pretransfer: 0,460355
time_redirect: 0,000000
time_starttransfer: 0,813697
time_total: 0,811853
-------------
time_namelookup: 0,121558
time_connect: 0,463243
time_pretransfer: 0,460755
time_redirect: 0,000000
time_starttransfer: 0,911697
time_total: 0,811413
我想为每个类别创建一个值列表,因此我认为在这种情况下正则表达式可能有意义。
import re
'''
In this exmaple, I save only the 'time_namelookup' parameter
The same logic adapted for other parameters.
'''
namelookup = []
with open('shaghai_if_config_test.txt', 'r') as fh:
for line in fh.readlines():
number_match = re.match('([+-]?([0-9]*[,])?[0-9]+)',line)
namelookup_match = re.match('^time_namelookup:', line)
if namelookup_match and number_match:
num = number_match.group(0)
namelookup.append(num)
continue
我发现这种逻辑非常复杂,因为我必须执行两次正则表达式匹配。此外,number_match
参数与该行不匹配,而^time_namelookup: ([+-]?([0-9]*[,])?[0-9]+)
工作正常
我正在为上述过程寻求经验丰富的建议。任何建议表示赞赏。
答案 0 :(得分:1)
我的猜测是您已经设计了一个很好的表达式,我们可能会对其稍作修改,以:
(time_(?:namelookup|connect|pretransfer|redirect|starttransfer|total))\s*:\s*([+-]?(?:\d*,)?\d+)
re.findall
测试:import re
regex = r"(time_(?:namelookup|connect|pretransfer|redirect|starttransfer|total))\s*:\s*([+-]?(?:\d*,)?\d+)"
test_str = ("time_namelookup: 0,121668 \n"
"time_connect: 0,460643 \n"
"time_pretransfer: 0,460755 \n"
"time_redirect: 0,000000 \n"
"time_starttransfer: 0,811697 \n"
"time_total: 0,811813 \n")
print(re.findall(regex, test_str))
[('time_namelookup', '0,121668'), ('time_connect', '0,460643'), ('time_pretransfer', '0,460755'), ('time_redirect', '0,000000'), ('time_starttransfer', '0,811697'), ('time_total', '0,811813')]
re.finditer
测试:import re
regex = r"(time_(?:namelookup|connect|pretransfer|redirect|starttransfer|total))\s*:\s*([+-]?(?:\d*,)?\d+)"
test_str = ("time_namelookup: 0,121668 \n"
"time_connect: 0,460643 \n"
"time_pretransfer: 0,460755 \n"
"time_redirect: 0,000000 \n"
"time_starttransfer: 0,811697 \n"
"time_total: 0,811813 \n"
"-------------\n"
"time_namelookup: 0,121665 \n"
"time_connect: 0,460643 \n"
"time_pretransfer: 0,460355 \n"
"time_redirect: 0,000000 \n"
"time_starttransfer: 0,813697 \n"
"time_total: 0,811853 \n"
"-------------\n"
"time_namelookup: 0,121558 \n"
"time_connect: 0,463243 \n"
"time_pretransfer: 0,460755 \n"
"time_redirect: 0,000000 \n"
"time_starttransfer: 0,911697 \n"
"time_total: 0,811413 ")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
如果要浏览/简化/修改该表达式,请在this demo的右上角进行解释。
jex.im可视化正则表达式:
答案 1 :(得分:1)
通过在捕获左侧的列表上循环,可以使操作变得更加容易:
import re
lst = ['time_namelookup', 'time_connect', 'time_pretransfer', 'time_redirect', 'time_starttransfer', 'time_total']
result = []
for x in lst:
result.append(re.findall(f'{x}: (.*)', s))
print(result)
其中s
是您的文本文件数据。
答案 2 :(得分:1)
您还可以将itertools.groupby
和str.split
应用于非正则表达式解决方案:
from itertools import groupby
data = [i.strip('\n') for i in open('filename.txt')]
new_data = [[a, list(b)] for a, b in groupby(data, key=lambda x:x.startswith('time'))]
results = [dict(i.split(': ') for i in b) for a, b in new_data if a]
输出:
[{'time_namelookup': '0,121668 ', 'time_connect': '0,460643 ', 'time_pretransfer': '0,460755 ', 'time_redirect': '0,000000 ', 'time_starttransfer': '0,811697 ', 'time_total': '0,811813 '},
{'time_namelookup': '0,121665 ', 'time_connect': '0,460643 ', 'time_pretransfer': '0,460355 ', 'time_redirect': '0,000000 ', 'time_starttransfer': '0,813697 ', 'time_total': '0,811853 '},
{'time_namelookup': '0,121558 ', 'time_connect': '0,463243 ', 'time_pretransfer': '0,460755 ', 'time_redirect': '0,000000 ', 'time_starttransfer': '0,911697 ', 'time_total': '0,811413 '}]
答案 3 :(得分:1)
如果格式很简单,这是另一个想法-使用冒号作为分隔符,使用CSV解析器读取文件。示例:
import csv
import itertools
from pprint import pprint as print
file = 'log.txt'
with open(file) as fp:
reader = csv.reader(fp, delimiter=':')
# filter out delimiter lines
rows = [r for r in reader if len(r) == 2]
# group pairs by first element to a dict of lists
grouped = {k: [x[1] for x in v] for k, v
in itertools.groupby(sorted(rows), key=lambda x: x[0])}
print(grouped)
会给您:
{'time_connect': [' 0.460643 ', ' 0.460643 ', ' 0.463243 '],
'time_namelookup': [' 0.121558 ', ' 0.121665 ', ' 0.121668 '],
'time_pretransfer': [' 0.460355 ', ' 0.460755 ', ' 0.460755 '],
'time_redirect': [' 0.000000 ', ' 0.000000 ', ' 0.000000 '],
'time_starttransfer': [' 0.811697 ', ' 0.813697 ', ' 0.911697 '],
'time_total': [' 0.811413 ', ' 0.811813 ', ' 0.811853 ']}
如果您需要进一步处理,请以字典理解的方式进行处理,例如用于解析数字:
grouped = {k: [float(x[1].strip()) for x in v] for k, v
in itertools.groupby(sorted(rows), key=lambda x: x[0])}
输出:
{'time_connect': [0.460643, 0.460643, 0.463243],
'time_namelookup': [0.121558, 0.121665, 0.121668],
'time_pretransfer': [0.460355, 0.460755, 0.460755],
'time_redirect': [0.0, 0.0, 0.0],
'time_starttransfer': [0.811697, 0.813697, 0.911697],
'time_total': [0.811413, 0.811813, 0.811853]}
pandas
如果碰巧有pandas
,则可以使用它来读取CSV格式的日志,这将节省您分析和分组数据的麻烦。示例:
import pandas as pd
df = pd.read_csv('log.txt', delimiter=':', header=None, names=['Name', 'Num']).dropna().reset_index(drop=True)
print(df)
将输出已解析并准备使用的数据:
Name Num
0 time_namelookup 0.121668
1 time_connect 0.460643
2 time_pretransfer 0.460755
3 time_redirect 0.000000
4 time_starttransfer 0.811697
5 time_total 0.811813
6 time_namelookup 0.121665
7 time_connect 0.460643
8 time_pretransfer 0.460355
9 time_redirect 0.000000
10 time_starttransfer 0.813697
11 time_total 0.811853
12 time_namelookup 0.121558
13 time_connect 0.463243
14 time_pretransfer 0.460755
15 time_redirect 0.000000
16 time_starttransfer 0.911697
17 time_total 0.811413
现在,无论您打算如何处理数据,例如重塑数据框以获得更结构化的视图:
df['chunk'] = df.index // df.Name.unique().size
print(df.pivot(values='Num', columns='Name', index='chunk'))
# Output:
Name time_connect time_namelookup time_pretransfer time_redirect time_starttransfer time_total
chunk
0 0.460643 0.121668 0.460755 0.0 0.811697 0.811813
1 0.460643 0.121665 0.460355 0.0 0.813697 0.811853
2 0.463243 0.121558 0.460755 0.0 0.911697 0.811413
计算选定时间的统计信息:
print(df[df.Name == 'time_total'].describe())
# Output:
Num
count 3.000000
mean 0.811693
std 0.000243
min 0.811413
25% 0.811613
50% 0.811813
75% 0.811833
max 0.811853
等