我该如何解析以下日志?

时间:2009-06-10 18:44:15

标签: python parsing

我需要按以下格式解析日志:

===== Item 5483/14800  =====
This is the item title
Info: some note
===== Item 5483/14800 (Update 1/3) =====
This is the item title
Info: some other note
===== Item 5483/14800 (Update 2/3) =====
This is the item title
Info: some more notes
===== Item 5483/14800 (Update 3/3) =====
This is the item title
Info: some other note
Test finished. Result Foo. Time 12 secunds.
Stats: CPU 0.5 MEM 5.3
===== Item 5484/14800  =====
This is this items title
Info: some note
Test finished. Result Bar. Time 4 secunds.
Stats: CPU 0.9 MEM 4.7
===== Item 5485/14800  =====
This is the title of this item
Info: some note
Test finished. Result FooBar. Time 7 secunds.
Stats: CPU 2.5 MEM 2.8

我只需要提取每个项目的标题(=====项目5484/14800 =====之后的下一行)和结果。
所以我需要只保留项目标题的行和该标题的结果,并丢弃其他所有内容 问题是,有时一个项目有注释(格言3),有时结果显示没有附加注释,所以这使得它很棘手。
任何帮助,将不胜感激。我在python中做解析器,但不需要实际的代码,但有些指向我怎么能得到这个呢?

LE:我正在寻找的结果就是抛弃其他所有东西并得到类似的东西:

('This is the item title','Foo')
then
('This is this items title','Bar')

9 个答案:

答案 0 :(得分:5)

1) Loop through every line in the log

    a)If line matches appropriate Regex:

      Display/Store Next Line as the item title.
      Look for the next line containing "Result 
      XXXX." and parse out that result for 
      including in the result set.

编辑:现在添加了一点,我看到了你正在寻找的结果。

答案 1 :(得分:5)

我知道你没有要求真正的代码,但这对于基于生成器的文本管理器来说是一个太大的机会了:

# data is a multiline string containing your log, but this
# function could be easily rewritten to accept a file handle.
def get_stats(data):

   title = ""
   grab_title = False

   for line in data.split('\n'):
      if line.startswith("====="):
         grab_title = True
      elif grab_title:
         grab_title = False
         title = line
      elif line.startswith("Test finished."):
         start = line.index("Result") + 7
         end   = line.index("Time")   - 2
         yield (title, line[start:end])


for d in get_stats(data):
   print d


# Returns:
# ('This is the item title', 'Foo')
# ('This is this items title', 'Bar')
# ('This is the title of this item', 'FooBar')

希望这很简单。请问您是否对上述工作的确切方式有疑问。

答案 2 :(得分:1)

也许像(log.log是你的文件):

def doOutput(s): # process or store data
    print s

s=''
for line in open('log.log').readlines():
    if line.startswith('====='):
        if len(s):
            doOutput(s)
            s=''
    else:
        s+=line
if len(s):
    doOutput(s)

答案 3 :(得分:1)

我建议启动一个循环来查找行中的“===”。让那个关键你去标题,这是下一行。设置一个查找结果的标志,如果在点击下一个“===”之前没有找到结果,则说没有结果。否则,使用标题记录结果。重置你的旗帜并重复。您也可以将结果与标题一起存储在字典中,只有在标题和下一个“===”行之间找不到结果时才存储“无结果”。

基于输出,这看起来很简单。

答案 4 :(得分:1)

使用组匹配的正则表达式似乎可以在python中完成这项工作:

import re

data = """===== Item 5483/14800  =====
This is the item title
Info: some note
===== Item 5483/14800 (Update 1/3) =====
This is the item title
Info: some other note
===== Item 5483/14800 (Update 2/3) =====
This is the item title
Info: some more notes
===== Item 5483/14800 (Update 3/3) =====
This is the item title
Info: some other note
Test finished. Result Foo. Time 12 secunds.
Stats: CPU 0.5 MEM 5.3
===== Item 5484/14800  =====
This is this items title
Info: some note
Test finished. Result Bar. Time 4 secunds.
Stats: CPU 0.9 MEM 4.7
===== Item 5485/14800  =====
This is the title of this item
Info: some note
Test finished. Result FooBar. Time 7 secunds.
Stats: CPU 2.5 MEM 2.8"""


p =  re.compile("^=====[^=]*=====\n(.*)$\nInfo: .*\n.*Result ([^\.]*)\.",
                re.MULTILINE)
for m in re.finditer(p, data):
     print "title:", m.group(1), "result:", m.group(2)er code here

如果您需要有关正则表达式的更多信息,请检查:python docs

答案 5 :(得分:1)

这是maciejka解决方案的延续(参见那里的评论)。如果数据在daniels.log文件中,那么我们可以使用itertools.groupby逐项查看,并对每个项目应用多行正则表达式。这应该可以扩展。

import itertools, re

p = re.compile("Result ([^.]*)\.", re.MULTILINE)
for sep, item in itertools.groupby(file('daniels.log'),
                                   lambda x: x.startswith('===== Item ')):
    if not sep:
        title = item.next().strip()
        m = p.search(''.join(item))
        if m:
            print (title, m.group(1))

答案 6 :(得分:0)

使用正则表达式进行解析。如果你有一个结构合理的文本(它看起来像你那样),你可以使用更快的测试(例如line.startswith()等)。 字典列表似乎是这种键值对的合适数据类型。不知道还能告诉你什么。这看起来非常简单。


好的,所以在这种情况下,regexp方式更适合:

import re
re.findall("=\n(.*)\n", s)

比列表推导更快

[item.split('\n', 1)[0] for item in s.split('=\n')]

这是我得到的:

>>> len(s)
337000000
>>> test(get1, s) #list comprehensions
0:00:04.923529
>>> test(get2, s) #re.findall()
0:00:02.737103

经验教训。

答案 7 :(得分:0)

你可以尝试这样的东西(在类似c的伪代码中,因为我不知道python):

string line=getline();
regex boundary="^==== [^=]+ ====$";
regex info="^Info: (.*)$";
regex test_data="Test ([^.]*)\. Result ([^.]*)\. Time ([^.]*)\.$";
regex stats="Stats: (.*)$";
while(!eof())
{
  // sanity check
  test line against boundary, if they don't match, throw excetion

  string title=getline();

  while(1)
  {  
    // end the loop if we finished the data
    if(eof()) break;

    line=getline();
    test line against boundary, if they match, break
    test line against info, if they match, load the first matched group into "info"
    test line against test_data, if they match, load the first matched group into "test_result", load the 2nd matched group into "result", load the 3rd matched group into "time"
    test line against stats, if they match, load the first matched group into "statistics"
  }

  // at this point you can use the variables set above to do whatever with a line
  // for example, you want to use title and, if set, test_result/result/time.

}

答案 8 :(得分:-1)

这里有一些不那么好看的perl代码可以完成这项工作。也许你会发现它在某些方面很有用。快速破解,还有其他方法(我觉得这个代码需要防御)。

#!/usr/bin/perl -w
#
# $Id$
#

use strict;
use warnings;

my @ITEMS;
my $item;
my $state = 0;

open(FD, "< data.txt") or die "Failed to open file.";
while (my $line = <FD>) {
    $line =~ s/(\r|\n)//g;
    if ($line =~ /^===== Item (\d+)\/\d+/) {
        my $item_number = $1;
        if ($item) {
            # Just to make sure we don't have two lines that seems to be a headline in a row.
            # If we have an item but haven't set the title it means that there are two in a row that matches.
            die "Something seems to be wrong, better safe than sorry. Line $. : $line\n" if (not $item->{title});
            # If we have a new item number add previuos item and create a new.
            if ($item_number != $item->{item_number}) {
                push(@ITEMS, $item);
                $item = {};
                $item->{item_number} = $item_number;
            }
        } else {
            # First entry, don't have an item.
            $item = {}; # Create new item.
            $item->{item_number} = $item_number;
        }
        $state = 1;
    } elsif ($state == 1) {
        die "Data must start with a headline." if (not $item);
        # If we already have a title make sure it matches.
        if ($item->{title}) {
            if ($item->{title} ne $line) {
                die "Title doesn't match for item " . $item->{item_number} . ", line $. : $line\n";
            }
        } else {
            $item->{title} = $line;
        }
        $state++;
    } elsif (($state == 2) && ($line =~ /^Info:/)) {
        # Just make sure that for state 2 we have a line that match Info.
        $state++;
    } elsif (($state == 3) && ($line =~ /^Test finished\. Result ([^.]+)\. Time \d+ secunds{0,1}\.$/)) {
        $item->{status} = $1;
        $state++;
    } elsif (($state == 4) && ($line =~ /^Stats:/)) {
        $state++; # After Stats we must have a new item or we should fail.
    } else {
        die "Invalid data, line $.: $line\n";
    }
}
# Need to take care of the last item too.
push(@ITEMS, $item) if ($item);
close FD;

# Loop our items and print the info we stored.
for $item (@ITEMS) {
    print $item->{item_number} . " (" . $item->{status} . ") " . $item->{title} . "\n";
}