Question

有人可以帮我解决以下问题吗？我有一个包含数千行的日志文件，如下所示： -

    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 1 timestamp: 00:00:02,217
    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 ack: 13 timestamp: 00:00:04,537
    jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 recv: 1 timestamp: 00:00:08,018
    jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 nack: 14 timestamp: 00:00:10,338

我想编写一个python脚本来迭代这个文件，并基于jarid（日志文件中的第二个字段）来获取发现jarid的每一行的时间戳，并在同一行上打印它们。例如，对于以下两行： -

    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 1 timestamp: 00:00:02,217 
    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 ack: 13 timestamp: 00:00:04,537

我会得到以下输出： -

    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 00:00:02,217 ack: 00:00:04,537

我认为实现这一目标的最好方法是使用字典（或者不是！，请注释）。我写了下面的脚本，它有点工作，但它没有给我所需的输出： -

#!/opt/SP/bin/python

    log = file(/opt/SP/logs/generic.log, "r")
    filecontent = log.xreadlines()
    storage = {}
    for line in filecontent:
        line = line.strip()
        jarid, JARID, status, STATUS, timestamp, TIME = line.split(" ")
        if JARID not in storage:
            storage[JARID] = {}
        if STATUS not in storage[JARID]:
            storage[JARID][STATUS] = {}
        if TIME not in storage[JARID][STATUS]:
            storage[JARID][STATUS][TIME] = {}

    jarids = storage.keys()
    jarids.sort()
    for JARID in jarids:
        stats = storage[JARID].keys()
        stats.sort()
        for STATUS in stats:
            times = storage[JARID][STATUS].keys()
            times.sort()
            for TIME in times:
                all = storage[JARID][STATUS][TIME].keys()
                all.sort()

    for JARID in jarids:
        if "1" in storage[JARID].keys() and "13" in storage[JARID].keys():
            print "MSG: %s, RECV: %s, ACK: %s" % (JARID, storage[JARID]["1"], storage[JARID]["13"])
        else:
            if "1" in storage[JARID].keys() and "14" in storage[JARID].keys():
                print "MSG: %s, RECV: %s, NACK: %s" % (JARID, storage[JARID]["1"], storage[JARID]["14"])

当我运行此脚本时，我得到以下输出： -

    MSG: 7e5ae720-9151-11e0-eff2-00238bce4216, RECV: {'00:00:02,217': {}}, ACK: {'00:00:04,537': {}}

请注意我仍然在学习python，而且我的脚本技能并不是全部！

拜托，请问我能否帮助我弄清楚如上所述如何获得所需的输出？

Answer 1

基于JBernardo的回答，但使用defaultdict而不是setdefault。您可以完全相同的方式打印它，所以我不会在这里复制该代码

from collections import defaultdict
log = ['jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 1 timestamp: 00:00:02,217',
       'jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 ack: 13 timestamp: 00:00:04,537',
       'jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 recv: 1 timestamp: 00:00:08,018',
       'jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 nack: 14 timestamp: 00:00:10,338']

d = defaultdict(dict)
for i in (line.split() for line in log):
    d[i[1]][i[2]] = i[-1]

您还可以解压缩为有意义的名称。例如

for label1, jarid, jartype, x, label2, timestamp in (line.split() for line in log):
    d[jarid][jartype] = timestamp

Answer 2

我不会让status成为一本字典。相反，我只会在timestamp词典中为每个status键存储jarid。用一个例子更好地解释......

def search_jarids(jarid):
    stored_jarid = storage[jarid]
    entry = "jarid: %s" % jarid
    for status in stored_jarid:
        entry += " %s: %s" % (status, stored_jarid[status])
    return entry

with open("yourlog.log", 'r') as log:
    lines = log.readlines()

storage = {}

for line in lines:
    line = line.strip()
    jarid_tag, jarid, status_tag, status, timestamp_tag, timestamp = line.split(" ")

    if jarid not in storage:
        storage[jarid] = {}

    status_tag = status_tag[:-1]
    storage[jarid][status_tag] = timestamp

print search_jarids("462c6d11-9151-11e0-a72c-00238bbdc9e7")

会给你：

jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 nack: 00:00:10,338 recv: 00:00:08,018

希望它能让你开始。

Answer 3

那应该有用。的更新

使用：

log = ['jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 1 timestamp: 00:00:02,217', 'jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 ack: 13 timestamp: 00:00:04,537', 'jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 recv: 1 timestamp: 00:00:08,018', 'jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 nack: 14 timestamp: 00:00:10,338']

你可以这样做：

d = {} for i in (line.split() for line in log): d.setdefault(i[1], {}).update({i[2]:i[-1]}) #as pointed by @gnibbler, you can also use "defaultdict" #instead of dict with "setdefault"

然后你可以打印出来：

for i,j in d.items(): print 'jarid:', i, for k,m in j.items(): print k, m, print

Answer 4

这是一个正则表达式解决方案：

import re
pattern = re.compile(r"""jarid:\s(\S+)       # save jarid to group 1
                         \s(recv:)\s\d+      # save 'recv:' to group 2
                         \stimestamp:\s(\S+) # save recv timestamp to group 3
                         .*?jarid:\s\1       # make sure next line has same jarid
                         \s(n?ack:)\s\d+     # save 'ack:' or 'nack:' to group 4
                         \stimestamp:\s(\S+) # save ack timestamp to group 5
                     """, re.VERBOSE | re.DOTALL | re.MULTILINE)

for content in pattern.finditer(log):
    print "    jarid: " + " ".join(content.groups())

Answer 5

这个解决方案有点类似于@JBernardo，尽管我选择用正则表达式解析这些行。我现在写了它，所以我也可以发表它;可能有用。

import re

line_pattern = re.compile(
    r"jarid: (?P<jarid>[a-z0-9\-]+) (?P<action>[a-z]+): (?P<status>[0-9]+) timestamp: (?P<ts>[0-9\:,]+)"
)

infile = open('/path/to/file.log')
entries = (line_pattern.match(line).groupdict() for line in infile)
events = {}

for entry in entries:
    event = events.setdefault(entry['jarid'], {})
    event[entry['action']] = entry['ts']

for jarid, event in events.iteritems():
    ack_event = 'ack' if 'ack' in event else 'nack' if 'nack' in event else None
    print 'jarid: %s recv: %s %s: %s' % (jarid, event.get('recv'), ack_event, event.get(ack_event))

Python查询：迭代日志文件

5 个答案: