我正在尝试在Python中执行以下操作,也使用一些bash脚本。除非在Python中有更简单的方法。
我有一个日志文件,其数据如下所示:
16:14:59.027003 - WARN - Cancel Latency: 100ms - OrderId: 311yrsbj - On Venue: ABCD
16:14:59.027010 - WARN - Ack Latency: 25ms - OrderId: 311yrsbl - On Venue: EFGH
16:14:59.027201 - WARN - Ack Latency: 22ms - OrderId: 311yrsbn - On Venue: IJKL
16:14:59.027235 - WARN - Cancel Latency: 137ms - OrderId: 311yrsbp - On Venue: MNOP
16:14:59.027256 - WARN - Cancel Latency: 220ms - OrderId: 311yrsbr - On Venue: QRST
16:14:59.027293 - WARN - Ack Latency: 142ms - OrderId: 311yrsbt - On Venue: UVWX
16:14:59.027329 - WARN - Cancel Latency: 134ms - OrderId: 311yrsbv - On Venue: YZ
16:14:59.027359 - WARN - Ack Latency: 75ms - OrderId: 311yrsbx - On Venue: ABCD
16:14:59.027401 - WARN - Cancel Latency: 66ms - OrderId: 311yrsbz - On Venue: ABCD
16:14:59.027426 - WARN - Cancel Latency: 212ms - OrderId: 311yrsc1 - On Venue: EFGH
16:14:59.027470 - WARN - Cancel Latency: 89ms - OrderId: 311yrsf7 - On Venue: IJKL
16:14:59.027495 - WARN - Cancel Latency: 97ms - OrderId: 311yrsay - On Venue: IJKL
我需要从每一行中提取最后一个条目,然后使用每个唯一条目并搜索它出现的每一行并将其导出到.csv文件。
我使用以下bash脚本来获取每个唯一条目:
cat LogFile _ date +%Y%m%d
。msg.log | awk'{print $ 14}'|排序| uniq的
根据日志文件中的上述数据,bash脚本将返回以下结果:
ABCD
EFGH
IJKL
MNOP
QRST
UVWX
YZ
现在我想在同一个日志文件中搜索(或grep)每个结果并返回前十个结果。我有另一个bash脚本来执行此操作,但是,我如何使用for LOOP?因此,对于x,其中x =上面的每个条目,
grep x LogFile _ date +%Y%m%d
。msg.log | awk'{print $ 7}'| sort -nr | uniq |头-10 -
然后将结果返回到.csv文件中。结果看起来像这样(每个字段在一个单独的列中):
Column-A Column-B Column-C Column-D
ABCD 2sxrb6ab Cancel 46ms
ABCD 2sxrb6af Cancel 45ms
ABCD 2sxrb6i2 Cancel 63ms
ABCD 2sxrb6i3 Cancel 103ms
EFGH 2sxrb6i4 Cancel 60ms
EFGH 2sxrb6i7 Cancel 60ms
IJKL 2sxrb6ie Ack 74ms
IJKL 2sxrb6if Ack 74ms
IJKL 2sxrb76s Cancel 46ms
MNOP vcxrqrs5 Cancel 7651ms
我是Python的初学者,自从大学(13年前)以来没有做过多少编码。任何帮助将不胜感激。感谢。
答案 0 :(得分:1)
假设您已打开文件。你想要做的是记录每个条目在那里的次数,也就是说,每个条目将导致一个或多个时间:
from collections import defaultdict
entries = defaultdict(list)
for line in your_file:
# Parse the line and return the 'ABCD' part and time
column_a, timing = parse(line)
entries[column_a].append(timing)
完成后,你会有一本字典:
{ 'ABCD': ['30ms', '25ms', '12ms'],
'EFGH': ['12ms'],
'IJKL': ['2ms', '14ms'] }
您现在要做的是将此字典转换为另一个按其值len
排序的数据结构(这是一个列表)。例如:
In [15]: sorted(((k, v) for k, v in entries.items()),
key=lambda i: len(i[1]), reverse=True)
Out[15]:
[('ABCD', ['30ms', '25ms', '12ms']),
('IJKL', ['2ms', '14ms']),
('EFGH', ['12ms'])]
当然这只是说明性的,您可能希望在原始for
循环中收集更多数据。
答案 1 :(得分:0)
也许你的想法并不简洁......但我认为这可以解决你的问题。我添加了一些try ... catch以更好地解决实际数据。
import re
import os
import csv
import collections
# get all logfiles under current directory of course this pattern can be more
# sophisticated, but it's not our attention here, isn't it?
log_pattern = re.compile(r"LogFile_date[0-9]{8}.msg.log")
logfiles = [f for f in os.listdir('./') if log_pattern.match(f)]
# top n
nhead = 10
# used to parse useful fields
extract_pattern = re.compile(
r'.*Cancel Latency: ([0-9]+ms) - OrderId: ([0-9a-z]+) - On Venue: ([A-Z]+)')
# container for final results
res = collections.defaultdict(list)
# parse out all interesting fields
for logfile in logfiles:
with open(logfile, 'r') as logf:
for line in logf:
try: # in case of blank line or line with no such fields.
latency, orderid, venue = extract_pattern.match(line).groups()
except AttributeError:
continue
res[venue].append((orderid, latency))
# write to csv
with open('res.csv', 'w') as resf:
resc = csv.writer(resf, delimiter=' ')
for venue in sorted(res.iterkeys()): # sort by Venue
entries = res[venue]
entries.sort() # sort by OrderId
for i in range(0, nhead):
try:
resc.writerow([venue, entries[i][0], 'Cancel ' + entries[i][1]])
except IndexError: # nhead can not be satisfied
break