Python查询:迭代日志文件

时间:2011-06-20 23:07:35

标签: python dictionary iteration

有人可以帮我解决以下问题吗? 我有一个包含数千行的日志文件,如下所示: -

    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 1 timestamp: 00:00:02,217
    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 ack: 13 timestamp: 00:00:04,537
    jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 recv: 1 timestamp: 00:00:08,018
    jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 nack: 14 timestamp: 00:00:10,338

我想编写一个python脚本来迭代这个文件,并基于jarid(日志文件中的第二个字段)来获取发现jarid的每一行的时间戳,并在同一行上打印它们。例如,对于以下两行: -

    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 1 timestamp: 00:00:02,217 
    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 ack: 13 timestamp: 00:00:04,537

我会得到以下输出: -

    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 00:00:02,217 ack: 00:00:04,537

我认为实现这一目标的最好方法是使用字典(或者不是!,请注释)。我写了下面的脚本,它有点工作,但它没有给我所需的输出: -

#!/opt/SP/bin/python

    log = file(/opt/SP/logs/generic.log, "r")
    filecontent = log.xreadlines()
    storage = {}
    for line in filecontent:
        line = line.strip()
        jarid, JARID, status, STATUS, timestamp, TIME = line.split(" ")
        if JARID not in storage:
            storage[JARID] = {}
        if STATUS not in storage[JARID]:
            storage[JARID][STATUS] = {}
        if TIME not in storage[JARID][STATUS]:
            storage[JARID][STATUS][TIME] = {}

    jarids = storage.keys()
    jarids.sort()
    for JARID in jarids:
        stats = storage[JARID].keys()
        stats.sort()
        for STATUS in stats:
            times = storage[JARID][STATUS].keys()
            times.sort()
            for TIME in times:
                all = storage[JARID][STATUS][TIME].keys()
                all.sort()

    for JARID in jarids:
        if "1" in storage[JARID].keys() and "13" in storage[JARID].keys():
            print "MSG: %s, RECV: %s, ACK: %s" % (JARID, storage[JARID]["1"], storage[JARID]["13"])
        else:
            if "1" in storage[JARID].keys() and "14" in storage[JARID].keys():
                print "MSG: %s, RECV: %s, NACK: %s" % (JARID, storage[JARID]["1"], storage[JARID]["14"])

当我运行此脚本时,我得到以下输出: -

    MSG: 7e5ae720-9151-11e0-eff2-00238bce4216, RECV: {'00:00:02,217': {}}, ACK: {'00:00:04,537': {}}

请注意我仍然在学习python,而且我的脚本技能并不是全部!

拜托,请问我能否帮助我弄清楚如上所述如何获得所需的输出?

5 个答案:

答案 0 :(得分:2)

基于JBernardo的回答,但使用defaultdict而不是setdefault。您可以完全相同的方式打印它,所以我不会在这里复制该代码

from collections import defaultdict
log = ['jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 1 timestamp: 00:00:02,217',
       'jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 ack: 13 timestamp: 00:00:04,537',
       'jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 recv: 1 timestamp: 00:00:08,018',
       'jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 nack: 14 timestamp: 00:00:10,338']

d = defaultdict(dict)
for i in (line.split() for line in log):
    d[i[1]][i[2]] = i[-1]

您还可以解压缩为有意义的名称。例如

for label1, jarid, jartype, x, label2, timestamp in (line.split() for line in log):
    d[jarid][jartype] = timestamp

答案 1 :(得分:0)

我不会让status成为一本字典。相反,我只会在timestamp词典中为每个status键存储jarid。用一个例子更好地解释......

def search_jarids(jarid):
    stored_jarid = storage[jarid]
    entry = "jarid: %s" % jarid
    for status in stored_jarid:
        entry += " %s: %s" % (status, stored_jarid[status])
    return entry

with open("yourlog.log", 'r') as log:
    lines = log.readlines()

storage = {}

for line in lines:
    line = line.strip()
    jarid_tag, jarid, status_tag, status, timestamp_tag, timestamp = line.split(" ")

    if jarid not in storage:
        storage[jarid] = {}

    status_tag = status_tag[:-1]
    storage[jarid][status_tag] = timestamp

print search_jarids("462c6d11-9151-11e0-a72c-00238bbdc9e7")

会给你:

jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 nack: 00:00:10,338 recv: 00:00:08,018

希望它能让你开始。

答案 2 :(得分:0)

那应该有用。的更新

使用:

log = ['jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 1 timestamp: 00:00:02,217',
       'jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 ack: 13 timestamp: 00:00:04,537',
       'jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 recv: 1 timestamp: 00:00:08,018',
       'jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 nack: 14 timestamp: 00:00:10,338']

你可以这样做:

d = {}
for i in (line.split() for line in log):
    d.setdefault(i[1], {}).update({i[2]:i[-1]})

#as pointed by @gnibbler, you can also use "defaultdict"
#instead of dict with "setdefault"

然后你可以打印出来:

for i,j in d.items():
    print 'jarid:', i,
    for k,m in j.items():
        print k, m,
    print

答案 3 :(得分:0)

这是一个正则表达式解决方案:

import re
pattern = re.compile(r"""jarid:\s(\S+)       # save jarid to group 1
                         \s(recv:)\s\d+      # save 'recv:' to group 2
                         \stimestamp:\s(\S+) # save recv timestamp to group 3
                         .*?jarid:\s\1       # make sure next line has same jarid
                         \s(n?ack:)\s\d+     # save 'ack:' or 'nack:' to group 4
                         \stimestamp:\s(\S+) # save ack timestamp to group 5
                     """, re.VERBOSE | re.DOTALL | re.MULTILINE)

for content in pattern.finditer(log):
    print "    jarid: " + " ".join(content.groups())

答案 4 :(得分:0)

这个解决方案有点类似于@JBernardo,尽管我选择用正则表达式解析这些行。我现在写了它,所以我也可以发表它;可能有用。

import re

line_pattern = re.compile(
    r"jarid: (?P<jarid>[a-z0-9\-]+) (?P<action>[a-z]+): (?P<status>[0-9]+) timestamp: (?P<ts>[0-9\:,]+)"
)

infile = open('/path/to/file.log')
entries = (line_pattern.match(line).groupdict() for line in infile)
events = {}

for entry in entries:
    event = events.setdefault(entry['jarid'], {})
    event[entry['action']] = entry['ts']

for jarid, event in events.iteritems():
    ack_event = 'ack' if 'ack' in event else 'nack' if 'nack' in event else None
    print 'jarid: %s recv: %s %s: %s' % (jarid, event.get('recv'), ack_event, event.get(ack_event))