Question

我有一个Java应用程序，当错误输出时，会为每个错误写一个类似于下面的错误堆栈。

<Errors>
    <Error ErrorCode="Code" ErrorDescription="Description" ErrorInfo="" ErrorId="ID">
        <Attribute Name="ErrorCode" Value="Code"/>
        <Attribute Name="ErrorDescription" Value="Description"/>
        <Attribute Name="Key" Value="Key"/>
        <Attribute Name="Number" Value="Number"/>
        <Attribute Name="ErrorId" Value="ID"/>
        <Attribute Name="UserId" Value="User"/>
        <Attribute Name="ProgId" Value="Prog"/>
        <Stack>typical Java stack</Stack>
    </Error>
    <Error>
      Similar info to the above
    </Error>
</Errors>

我写了一个Java日志解析器来浏览日志文件并收集有关此类错误的信息，虽然它确实有效，但它速度慢且效率低，特别是对于数百兆字节的日志文件。我只是基本上使用字符串操作来检测开始/结束标记的位置并计算它们。

有没有办法（通过Unix grep，Python或Java）有效地提取错误并计算每个错误发生的次数？整个日志文件不是XML，因此我无法使用XML解析器或Xpath。我面临的另一个问题是，有时错误的结束可能会滚动到另一个文件中，因此当前文件可能没有上面的整个堆栈。

编辑1：

这是我目前所拥有的（相关部分仅用于节省空间）。

//Parse files
for (File f : allFiles) {
   System.out.println("Parsing: " + f.getAbsolutePath());
   BufferedReader br = new BufferedReader(new FileReader(f));
   String line = "";
   String fullErrorStack = "";
   while ((line = br.readLine()) != null) {     
      if (line.contains("<Errors>")) {
         fullErrorStack = line;
         while (!line.contains("</Errors>")) {
            line = br.readLine();
            try {
               fullErrorStack = fullErrorStack + line.trim() + " ";
            } catch (NullPointerException e) {
               //End of file but end of error stack is in another file.
               fullErrorStack = fullErrorStack + "</Stack></Error></Errors> ";
               break;
            }
         }
         String errorCode = fullErrorStack.substring(fullErrorStack.indexOf("ErrorCode=\"") + "ErrorCode=\"".length(), fullErrorStack.indexOf("\" ", fullErrorStack.indexOf("ErrorCode=\"")));
         String errorDescription = fullErrorStack.substring(fullErrorStack.indexOf("ErrorDescription=\"") + "ErrorDescription=\"".length(), fullErrorStack.indexOf("\" ", fullErrorStack.indexOf("ErrorDescription=\"")));
         String errorStack = fullErrorStack.substring(fullErrorStack.indexOf("<Stack>") + "<Stack>".length(), fullErrorStack.indexOf("</Stack>", fullErrorStack.indexOf("<Stack>")));
         apiErrors.add(f.getAbsolutePath() + splitter + errorCode + ": " + errorDescription + splitter + errorStack.trim());
         fullErrorStack = "";
      }
   }
}


Set<String> uniqueApiErrors = new HashSet<String>(apiErrors);
for (String uniqueApiError : uniqueApiErrors) {
    apiErrorsUnique.add(uniqueApiError + splitter + Collections.frequency(apiErrors, uniqueApiError));
}
Collections.sort(apiErrorsUnique);

编辑2：

很抱歉忘记提及所需的输出。像下面这样的东西是理想的。

Count，ErrorCode，ErrorDescription，它出现的文件列表（如果可能）

Answer 1

嗯，技术上不是grep，但是如果您打开使用其他标准的UNIX-esque命令，那么这里就可以完成这项任务，并且它应该很快（实际上有兴趣在数据集上看到结果）：

sed -r -e '/Errors/,/<\/Errors>/!d' *.log -ne 's/.*<Error\s+ErrorCode="([^"]*)"\s+ErrorDescription="([^"]*)".*$/\1: \2/p' | sort | uniq -c | sort -nr

假设他们按日期顺序排列，*.log glob也会解决日志滚动问题（当然，请调整以匹配日志命名）。

示例输出

根据您的（可疑的）测试数据：

 10 SomeOtherCode: This extended description
  4 Code: Description
  3 ReallyBadCode: Disaster Description

简要说明

使用sed仅在所选地址（此处为行）之间打印
再次使用sed使用正则表达式过滤这些内容，用一个组合的唯一足够的错误字符串（包括描述）替换标题行，类似于Java（或者至少我们可以看到它）< / LI>
对这些独特字符串进行排序和计数
以频率降序出现

Answer 2

鉴于您更新的问题：

$ cat tst.awk
BEGIN{ OFS="," }
match($0,/\s+*<Error ErrorCode="([^"]+)" ErrorDescription="([^"]+)".*/,a) {
    code = a[1]
    desc[code] = a[2]
    count[code]++
    files[code][FILENAME]
}
END {
    print "Count", "ErrorCode", "ErrorDescription", "List of files it occurs in"
    for (code in desc) {
        fnames = ""
        for (fname in files[code]) {
            fnames = (fnames ? fnames " " : "") fname
        }
        print count[code], code, desc[code], fnames
    }
}
$
$ awk -f tst.awk file
Count,ErrorCode,ErrorDescription,List of files it occurs in
1,Code,Description,file

它仍然需要gawk 4. *为第3个arg匹配（）和2D数组但是再次在任何awk中都可以轻松解决。

这里的评论中的每个请求都是一个非gawk版本：

$ cat tst.awk
BEGIN{ OFS="," }
/[[:space:]]+*<Error / {
    split("",n2v)
    while ( match($0,/[^[:space:]]+="[^"]+/) ) {
        name = value = substr($0,RSTART,RLENGTH)
        sub(/=.*/,"",name)
        sub(/^[^=]+="/,"",value)
        $0 = substr($0,RSTART+RLENGTH)
        n2v[name] = value
    }
    code = n2v["ErrorCode"]
    desc[code] = n2v["ErrorDescription"]
    count[code]++
    if (!seen[code,FILENAME]++) {
        fnames[code] = (code in fnames ? fnames[code] " " : "") FILENAME
    }
}
END {
    print "Count", "ErrorCode", "ErrorDescription", "List of files it occurs in"
    for (code in desc) {
        print count[code], code, desc[code], fnames[code]
    }
}
$
$ awk -f tst.awk file
Count,ErrorCode,ErrorDescription,List of files it occurs in
1,Code,Description,file

上面有各种方法可以做，有些更简洁，但是当输入包含name = value对时，我想创建一个name2value数组（n2v[]是我通常给它的名字）所以我可以按名称访问值。使代码易于理解和修改，以便添加字段等。

这是我之前的回答，因为在其中有一些东西，你会在其他情况下找到usefule：

你不能说出你希望输出看起来像什么，你发布的样本输入不足以测试和显示有用的输出，但是这个GNU awk脚本显示了获取计数的方法您喜欢的任何属性名称/值对：

$ cat tst.awk         
match($0,/\s+*<Attribute Name="([^"]+)" Value="([^"]+)".*/,a) { count[a[1]][a[2]]++ }
END {
    print "\nIf you just want to see the count of all error codes:"
    name = "ErrorCode"
    for (value in count[name]) {
        print name, value, count[name][value]
    }

    print "\nOr if theres a few specific attributes you care about:"
    split("ErrorId ErrorCode",names,/ /)
    for (i=1; i in names; i++) {
        name = names[i]
        for (value in count[name]) {
            print name, value, count[name][value]
        }
    }

    print "\nOr if you want to see the count of all values for all attributes:"
    for (name in count) {
        for (value in count[name]) {
            print name, value, count[name][value]
        }
    }
}

$ gawk -f tst.awk file

If you just want to see the count of all error codes:
ErrorCode Code 1

Or if theres a few specific attributes you care about:
ErrorId ID 1
ErrorCode Code 1

Or if you want to see the count of all values for all attributes:
ErrorId ID 1
ErrorDescription Description 1
ErrorCode Code 1
Number Number 1
ProgId Prog 1
UserId User 1
Key Key 1

如果您的数据分布在多个文件中，则上述内容无关紧要，只需在命令行中列出所有文件：

gawk -f tst.awk file1 file2 file3 ...

它使用GNU awk 4. *来表示真正的多维数组，但是如果需要的话，还有其他任何awk的简单解决方法。

在文件目录下以递归方式运行awk命令的一种方法：

awk -f tst.awk $(find dir -type f -print)

Answer 3

我认为既然你提到了Unix grep，你也可能有perl。这是一个简单的perl解决方案：

#!/usr/bin/perl

my %countForErrorCode;
while (<>) { /<Error ErrorCode="([^"]*)"/ && $countForErrorCode{$1}++ }
foreach my $e (keys %countForErrorCode) { print "$countForErrorCode{$e} $e\n" }

假设您正在运行* nix，请保存此perl脚本，使其可执行并使用类似命令运行...

$ ./grepError.pl *.log

你应该得到像......一样的输出。

8 Code1
203 Code2
...

其中'Code1'等是正则表达式中双引号之间捕获的错误代码。

我在Windows上使用Cygwin进行了这项工作。该解决方案假定：

您的perl位置为/usr/bin/perl。您可以使用$ which perl
上面的正则表达式/<Error ErrorCode="([^"]*)"/，就是你想要计算的方式。

代码正在......

my %errors声明了一张地图（哈希）。
while (<>)迭代每一行输入，并将当前行分配给内置变量$_。
/<Error ErrorCode="([^"]*)"/隐式尝试与$_进行匹配。
当匹配发生时，括号会捕获双引号之间的值，并将捕获的字符串分配给$ 1.
正则表达式“在匹配时返回true”，然后计数才会递增&& $countForErrorCode{$1}++。
对于输出，使用foreach my $e (keys %countForErrorCode)迭代捕获的错误代码，并使用print "$countForErrorCode{$e} $e\n"在一行上打印计数和代码。

修改：每个更新规范的更详细输出

#!/usr/bin/perl

my %dataForError;

while (<>) {
  if (/<Error ErrorCode="([^"]+)"\s*ErrorDescription="([^"]+)"/) {
    if (! $dataForError{$1}) {
      $dataForError{$1} = {}; 
      $dataForError{$1}{'desc'} = $2;
      $dataForError{$1}{'files'} = {};
    }
    $dataForError{$1}{'count'}++;
    $dataForError{$1}{'files'}{$ARGV}++;
  }
}
my @out;
foreach my $e (keys %dataForError) {
  my $files = join("\n\t", keys $dataForError{$e}{'files'});
  my $out = "$dataForError{$e}{'count'}, $e, '$dataForError{$e}{'desc'}'\n\t$files\n";
  push @out, $out;
}
print @out;

就像你上面发布的那样，为了递归地获取输入文件，你可以像以下一样运行这个脚本：

$ find . -name "*.log" | xargs grepError.pl

并产生如下输出：

8, Code2, 'bang'  
    ./today.log  
48, Code4, 'oops'  
    ./2015/jan/yesterday.log  
2, Code1, 'foobar'  
    ./2014/dec/someday.log

说明：

该脚本将每个唯一的错误代码映射到一个散列，该散列跟踪找到错误代码的计数，描述和唯一文件名。
Perl会自动将当前输入的文件名存储到$ARGV。
脚本会计算每个唯一的文件名，但不会输出这些计数。

从日志文件中提取Java错误堆栈

3 个答案:

示例输出

简要说明

它仍然需要gawk 4. *为第3个arg匹配（）和2D数组但是再次在任何awk中都可以轻松解决。