我试图解析一个日志文件(具体来说,来自Gradle构建),如下所示:
21:51:38.991 [DEBUG] [TestEventLogger] cha.LoginTest4 STARTED
21:51:39.054 [DEBUG] [TestEventLogger] cha.LoginTest2 STARTED
21:51:40.068 [DEBUG] [TestEventLogger] cha.LoginTest4 PASSED
21:51:40.101 [DEBUG] [TestEventLogger] cha.LoginTest2 PASSED
21:51:40.366 [DEBUG] [TestEventLogger] cha.LoginTest1 STARTED
21:51:40.413 [DEBUG] [TestEventLogger] cha.LoginTest3 STARTED
21:51:50.435 [DEBUG] [TestEventLogger] cha.LoginTest1 PASSED
21:51:50.463 [DEBUG] [TestEventLogger] cha.LoginTest3 PASSED
21:51:50.484 [DEBUG] [TestEventLogger] Gradle Test Run :test PASSED
21:51:38.622 [DEBUG] [TestEventLogger] Gradle Test Run :test STARTED
进入显示事件时间表的图表。有点像这样:
n | =======
a | ===
m | ==
e | =======
|______________
time
到目前为止,我已经解析了日志并将相关的"事件"到Pandas数据帧(按时间戳排序)。
log events parsed, sorted and ungrouped:
timestamp name
0 1900-01-01 21:51:38.622 test
0 1900-01-01 21:51:38.991 cha.LoginTest4
0 1900-01-01 21:51:39.054 cha.LoginTest2
0 1900-01-01 21:51:40.068 cha.LoginTest4
0 1900-01-01 21:51:40.101 cha.LoginTest2
0 1900-01-01 21:51:40.366 cha.LoginTest1
0 1900-01-01 21:51:40.413 cha.LoginTest3
0 1900-01-01 21:51:50.435 cha.LoginTest1
0 1900-01-01 21:51:50.463 cha.LoginTest3
0 1900-01-01 21:51:50.484 test
因为我需要每个" name"的开始和结束时间,所以我会groupby
。我得到的组看起来像这样:
group timestamp name
0 1900-01-01 21:51:38.991 cha.LoginTest4
0 1900-01-01 21:51:40.068 cha.LoginTest4
总有两行,第一行是开始时间,最后一行是结束时间。我能够使用hlines
来显示每个组的时间表。但是,我想让所有小组进入相同的情节,看看他们何时相对于彼此开始/结束。我仍然想使用groupby
,因为它可以让我得到开始/结束时间以及" name"几行代码。
我只能在不出错的情况下为每个群体展示情节,而不是全部展示情节。以下是我为每个情节所做的工作:
for name, group in df.groupby('name', sort=False):
group.amin = group['timestamp'].iloc[0] # assume sorted order
group.amax = group['timestamp'].iloc[1]
fig = plt.figure()
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(group.index, dt.date2num(group.amin), dt.date2num(group.amax))
plt.show()
已解决完整来源:
import os
import re
import pandas as pd
from pandas import Timestamp
import matplotlib.pyplot as plt
import matplotlib.dates as dt
import warnings
from random import random
from matplotlib.pyplot import text
from datetime import datetime
import numpy as np
warnings.simplefilter(action='ignore', category=FutureWarning) # https://stackoverflow.com/a/46721064
'''
The log contents are not guaranteed to be in order. Multiple processes are dumping contents into a single file.
Contents from a single process will be in order.
'''
def main():
log_file_path = "gradle-4.2.test.debug.log"
# regex to get test and task log events
test_re = re.compile('^(\S+) \[DEBUG\] \[TestEventLogger\] (\S+[^:>]) (STARTED|PASSED|FAILED)$')
task_re = re.compile('^(\S+) \[DEBUG\] \[TestEventLogger\] Gradle Test Run [:](\S+) (STARTED|PASSED|FAILED)$')
df = pd.DataFrame()
with open(log_file_path, "r") as file:
for line in file:
test_match = test_re.findall(line)
if test_match:
df = df.append(test_match)
else:
task_match = task_re.findall(line)
if task_match:
df = df.append(task_match)
file.close()
df.columns = ['timestamp','name','type']
df.drop('type', axis=1, inplace=True) # don't need this col
df['timestamp'] = pd.to_datetime(df.timestamp, format="%H:%M:%S.%f") # pandas datetime
df = df.sort_values('timestamp') # sort by pandas datetime
print ("log events parsed, sorted and ungrouped:\n", df)
fig, ax = plt.subplots()
ax.xaxis_date()
# Customize the major grid
ax.minorticks_on()
ax.grid(which='major', linestyle='-', linewidth='0.2', color='gray')
i = 0 # y-coord will be loop iteration
# Groupby name. Because the df was previously sorted, the tuple will be sorted order (first event, second event)
# Give each group an hline.
for name, group in df.groupby('name', sort=False):
i += 1
assert group['timestamp'].size == 2 # make sure we have a start & end time for each test/task
group.amin = group['timestamp'].iloc[0] # assume sorted order
group.amax = group['timestamp'].iloc[1]
assert group.amin < group.amax # make sure start/end times are in order
if '.' in name: # assume '.' indicates a JUnit test, not a task
color = [(random(),random(),random())]
linestyle = 'solid'
ax.text(group.amin, (i + 0.05), name, color='blue') # add name to x, y+.05 to hline
else: # a task.
color = 'black'
linestyle = 'dashed'
ax.text(group.amin, (i + 0.05), name + ' (Task)', color='red') # add name to x, y+.05 to hline
ax.hlines(i, dt.date2num(group.amin), dt.date2num(group.amax), linewidth = 6, color=color, linestyle=linestyle)
# Turn off y ticks. These are just execution order (numbers won't make sense).
plt.setp(ax.get_yticklabels(), visible=False)
ax.yaxis.set_tick_params(size=0)
ax.yaxis.tick_left()
plt.title('Timeline of Gradle Task and Test Execution')
plt.xlabel('Time')
plt.ylabel('Execution Order')
plt.show()
# plt.savefig('myfig')
if __name__ == '__main__':
main()
那么如何将这个带有时间戳的分组数据框组合到一个显示开始/结束时间线的图表中呢?
似乎我遇到了正则表达式,数据帧,日期时间等问题或其他问题,但我认为我得到了一个很好的清洁解决方案....
答案 0 :(得分:0)
现在无法测试,对不起,但是这个(或接近的东西)应该有帮助:在绘图循环之前创建一个图,然后将每个组的数据绘制到单个轴上。
fig, ax = plt.subplots()
ax.xaxis_date()
for name, group in df.groupby('name', sort=False):
group.amin = group['timestamp'].iloc[0] # assume sorted order
group.amax = group['timestamp'].iloc[1]
ax.hlines(group.index, dt.date2num(group.amin), dt.date2num(group.amax))
plt.show()
答案 1 :(得分:0)
我与这个问题的第一个关联是使用plt.barh
- 但我不得不承认我在日期时间主题上挣扎了一段时间,直到结果符合我的预期......
但是,这就是这个想法的结果:
假设,以下数据框将是开始:
df
Out:
timestamp name
0 21:51:38.622 test
1 21:51:38.991 cha.LoginTest4
2 21:51:39.054 cha.LoginTest2
3 21:51:40.068 cha.LoginTest4
4 21:51:40.101 cha.LoginTest2
5 21:51:40.366 cha.LoginTest1
6 21:51:40.413 cha.LoginTest3
7 21:51:50.435 cha.LoginTest1
8 21:51:50.463 cha.LoginTest3
9 21:51:50.484 test
首先,我按名称分组并创建一个新的数据框,其中包含matplotlib.dates
数据类型中的开始和持续时间数据:
grpd = df.groupby('name')
plot_data = pd.DataFrame({'start': dt.date2num(pd.to_datetime(grpd.min().timestamp)), 'stop': dt.date2num(pd.to_datetime(grpd.max().timestamp))}, grpd.min().index)
减去第一个开始时间从零开始(仍然添加1
,因为这是matplotlib.dates
开始的方式)
plot_data -= plot_data.start.min() - 1
plot_data['duration'] = plot_data.stop - plot_data.start
根据此数据框,随着时间的推移绘制水平条形图很容易:
fig, ax = plt.subplots(figsize=(8,4))
ax.xaxis_date()
ax.barh(plot_data.index, plot_data.duration, left=plot_data.start, height=.4)
plt.tight_layout()