Question

我是python的新手，并且很难解析日志文件。能帮助我了解如何以最恐怖的方式完成下面的工作。

----- Log Entry 5 -----
Time       : 2016-07-12 09:00:00
Animal     : Brown Bear
Bird       : White Owl
Fish       : Salmon


----- Log Entry 6 -----
Time       : 2016-07-12 09:00:00
Animal     : Brown Bear
Bird       : Parrot
Fish       : Tuna


----- Log Entry 7 -----
Time       : 2016-07-12 09:00:00
Animal     : Lion
Bird       : White Owl
Fish       : Sword Fish


----- Log Entry 8 -----
Time       : 2016-07-12 09:15:00
Animal     : Lion
Bird       : White Owl
Fish       : Sword Fish

所需输出1：我想重新格式化日志，如下所示：

Time: 2016-07-12 09:00:00 Animal: Brown Bear  Bird: White Owl  Fish : Salmon
Time: 2016-07-12 09:00:00 Animal: Brown Bear  Bird: Parrot     Fish : Tuna
Time: 2016-07-12 09:00:00 Animal: Lion        Bird: White Owl  Fish : Sword Fish
Time: 2016-07-12 09:15:00 Animal: Lion        Bird: White Owl  Fish : Sword Fish

期望输出2：然后我希望能够查询时间戳并获得计数摘要：

Time: 2016-07-12 09:00:00
Name:       Count:
Brown Bear  2
Lion        1
White Owl   2
Parrot      1
Salmon      1
Tuna        1
Sword Fish  1

Time: 2016-07-12 09:15:00
Name:       Count:
Lion        1
White Owl   1
Sword Fish  1

我的代码：

import os, sys, time, re, collections, subprocess

show_cmd = 'cat question |  egrep -v \'^$|=|Log\' | awk \'ORS=NR%4?FS:RS\' | grep Time'
log = (subprocess.check_output(show_cmd, shell=True).decode('utf-8'))

def time_field():
    logRegex = re.compile(r'Time\s*:.*\d\d\d-\d\d-\d\d\s\d\d:\d\d')
    log_parsed = (logRegex.findall(log))
    a = (str(log_parsed).replace('  ', ''))
    a = ((' ' + a[1:-1]).split(','))
    for i in a:
        print(i)

time_field()

Answer 1

有很多方法可以做到这一点。就个人而言，我会避免使用正则表达式，因为它可能不会更有效率，表达式变得麻烦且不灵活。我想出了以下内容：

class Entry:
    def __init__(self):
        self.time = None
        self.animal = None
        self.bird = None
        self.fish = None

    def __repr__(self):
        fmt = "{0} {1} {2} {3}".format(
            "Time: {time: <{width}}",
            "Animal: {animal: <{width}}",
            "Bird: {bird: <{width}}",
            "Fish: {fish: <{width}}")
        return fmt.format(
            time=self.time, animal=self.animal,
            bird=self.bird, fish=self.fish,
            width=12)

    def __radd__(self, other):
            return self.__add__(other)

    def __add__(self, other):
        if type(other) == dict:
            for i in [self.animal, self.bird, self.fish]:
                if i in other: other[i] += 1
                else: other[i] = 1
            return other
        elif type(other) == Entry:
            return self.__add__({}) + other
        else:
            return self.__add__({})

def parse_log(path):
    def extract(line):
        start = line.find(':') + 1
        return line[start:].strip()

    entries = []
    entry = None
    with open(path, 'r') as f:
        for line in f.readlines():
            if line.startswith('-----'):
                if entry: entries.append(entry)
                entry = Entry()
            elif line.startswith('Time'):
                entry.time = extract(line)
            elif line.startswith('Animal'):
                entry.animal = extract(line)
            elif line.startswith('Bird'):
                entry.bird = extract(line)
            elif line.startswith('Fish'):
                entry.fish = extract(line)

        if entry: entries.append(entry)

    return entries


def print_output_1(entries):
    for entry in entries:
        print entry

def print_output_2(entries, time):
    animals = sum([e for e in entries if e.time == time])

    print "Time: {0}".format(time)
    print "Name:        Count:"
    for animal, count in animals.items():
        print "{animal: <{width}} {count}".format(
                animal=animal, count=count, width=12)


logPath = 'log.log'
time = '2016-07-12 09:15:00'
entries = parse_log(logPath)

print_output_1(entries)
print ""
print_output_2(entries, time)

输出（假设log.log与您给出的输入相匹配）是：

Time: 2016-07-12 09:00:00 Animal: Brown Bear   Bird: White Owl    Fish: Salmon
Time: 2016-07-12 09:00:00 Animal: Brown Bear   Bird: Parrot       Fish: Tuna
Time: 2016-07-12 09:00:00 Animal: Lion         Bird: White Owl    Fish: Sword Fish
Time: 2016-07-12 09:15:00 Animal: Lion         Bird: White Owl    Fish: Sword Fish

Time: 2016-07-12 09:15:00
Name:        Count:
White Owl    1
Sword Fish   1
Lion         1

此代码的工作方式是使用面向对象编程，以简化我们需要执行的任务：存储日志条目，以特定格式表示日志条目，并根据特定属性组合日志条目。

首先，请注意Entry对象及其属性（self.time，self.animal，self.bird，self.fish）表示日志中的条目。假设存储在其属性中的信息是正确的，可以创建一种方法将该信息表示为格式化字符串。当python想要一个对象的字符串表示时，调用方法__repr__()，所以它似乎是放置这个代码的好地方。在此方法中大量使用format函数，但在浏览format上的python文档后，应该清楚它是如何工作的。

需要一种组合这些条目对象的方法，以获得您指定的第二个输出。这可以通过多种方式完成，我这样做的方式不一定是最好的。我使用了__radd__()和__add__()方法，这些方法在对象上使用+运算符时调用。通过这样做，代码entry1 + entry2或sum([entry1, entry2])可用于获取两个条目中动物的总和。但是，Entry类不能用于存储和的结果，因为它不能包含任意信息。相反，我选择使用dict对象作为求和两个Entry对象的结果。为了对两个以上Entry个对象求和，Entry也必须能够与dict对象相加，因为Entry + Entry + Entry会导致dict + Entry。

__add__()函数检查它所添加的对象是否是dict对象。如果是这种情况，它会检查条目中的每个动物是否已经存在于dict中。如果没有，它将添加动物作为关键。否则，它将增加该键的值。 __radd__()与__add__()类似，只是在某些特殊情况下使用它。有关更多信息，请参阅python文档。

对于对象是Entry的情况，可以编写代码来收集每个Entry对象中的所有动物，并从该信息创建dict，但是已经有代码要添加Entry dict，首先将一个对象添加到空dict，然后将结果dict与另一个Entry相加1}}对象。

对于所有其他对象，Entry只会返回自身的dict描述，或者自身添加空dict。

现在，所有工具都可用于实现前面列出的目标。要获得与所需输出1匹配的Entry的字符串表示，只需print entry或strrepr = str(entry)即可。要获得所需的输出2，需要做更多的工作，但它只是将所有具有相同self.time属性的条目相加，然后显示生成的dict。

未涵盖的代码的最后一部分是解析日志以创建Entry个对象的列表。代码只是逐行遍历日志，并使用信息填充Entry。我觉得这很简单，但如果没有意义，你可以随意提问。

在Python中重新格式化文本

1 个答案: