Question

我有以下需要在Unix服务器上直接分析的10 GB日志文件。

2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message1
2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message2
2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message3
2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message4
2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message5
2017-12-12 13:04:28,732 [ABC] [DEF] DEBUG some message6
2017-12-12 13:04:28,732 [ABC] [DEF] DEBUG <xml>
<id>1</id> 
<!—- id is not unique since the XML data provides all the
information of an object X defined by its id at a specific point in time -->
some XML content on more than 500 lines
</xml>
2017-12-12 13:04:30,330 [ABC] [DEF] DEBUG some message8
2017-12-12 13:04:30,333 [ABC] [DEF] DEBUG some message9
2017-12-12 13:04:30,334 [ABC] [DEF] INFO some message10
2017-12-12 13:04:30,334 [ABC] [DEF] INFO some message11
2017-12-12 13:04:31,431 [ABC] [DEF] INFO some message12
2017-12-12 13:04:28,732 [ABC] [DEF] DEBUG <xml>
<id>2</id>
some XML content on more than 500 lines 
</xml>
2017-12-12 13:04:31,432 [ABC] [DEF] DEBUG some message13
2017-12-12 13:04:31,476 [ABC] [DEF] INFO some message14
2017-12-12 13:04:31,476 [ABC] [DEF] DEBUG some message14
2017-12-12 13:04:31,490 [ABC] [DEF] DEBUG some message15
2017-12-12 13:04:28,732 [ABC] [DEF] DEBUG <xml>
<id>1</id>
some XML content on more than 500 lines 
</xml>
2017-12-12 13:04:31,491 [ABC] [DEF] DEBUG some message16
2017-12-12 13:04:31,491 [ABC] [DEF] DEBUG some message17
2017-12-12 13:04:31,496 [ABC] [DEF] DEBUG some message18
2017-12-12 13:04:31,996 [ABC] [DEF] INFO some message19

为了做到这一点，我想提取每条XML消息并将其转储到一个单独的文件中。

例如：第一条XML邮件将存储在file1.xml中，第二条存储在file2.xml中，依此类推。

如果必须将所有模式提取到一个文件中，那么它将非常直接：

sed -n 's~<xml>(\s*\.*\s*)\s*</xml>~p' file.in > file.out #just a prototype

我想到了一个解决方案，我可以在其中使用带有<id>标记的XML的后向引用，并使用它来命名我将其转储的文件，但它不起作用，因为它<id>标记的值确实出现在日志文件的不同位置，这将覆盖以前的提取。

sed -r 's~(<xml>…<id>(.*)</id>…</xml>)~echo "\1" >> \2.out~e' file.in #just a prototype

使用awk，如果XML内容在一行上，那么它也会非常简单。但是，情况并非如此，我不知道我应该为RS定义哪个行分隔符来处理XML内容，就像它在一行上一样并将其转储到单独的文件中。

awk，我认为可行的是：

首先在日志中标识<xml>起始标记，并将测试变量更改为yes
将每行XML存储在缓冲区变量中，然后在我获得file$i.out时将其转储到</xml>（当然，将测试变量重置为no）。

如果你有一个更好的awk解决方案或一个sed的解决方案，我可以在其中访问一个包含当前处理的模式编号的变量，并重用它来生成输出文件，会很好。（类似于：current_pattern_position用于生成file_$current_pattern_position.out）

我使用awk和perl获得了非常有趣的解决方案。我想为此案例制定一个sed工作解决方案

Answer 1

更新：以下是使用Sed的便携式简化方法：

#!/bin/sed -nf

# Execute the following group of commands for each line in the XML node to
# generate a series of shell commands that we'll feed into an interpreter:
/<xml>/,/<\/xml>/ {
    # Extract the ID number to generate a command that changes the output file:
    /^<id>\([0-9]\+\)<\/id>$/ {
        # Using the same pattern as above, substitute the ID number into a
        # command that updates the current output file and increments a counter
        # for the ID that we'll append as the filename extension:
        s//c\1=$(( c\1 + 1 )); exec > "file\1.$c\1"/
        # Output the generated command:
        p
        # Then, proceed to the next line:
        n
    }
    # Output any remaining lines in the XML block except for the <xml> tags:
    /<xml>\|<\/xml>/ !{
        # Escape any single quotes in the XML content (so we can wrap it in a
        # shell command below):
        s/'/'"'"'/g
        #'# (...ignore or remove this line...)
        # Generate a command that will write the line to the current file:
        s/^.*$/echo '&'/
        # Output the generated command:
        p
    }
}

正如我们所看到的，Sed程序从输入生成一系列shell命令，我们可以将它们传递给shell解释器来编写输出文件：

$ sed -nf parse_log.sed < file.in | sh

这避免了过多的保持空间缓冲和GNU Sed的e标志，这是非常缓慢的（我们需要在每次需要编写文件时生成子shell进程），并使我们能够有效地跟踪数字我们遇到ID的次数，所以我们可以增加文件名中的数字。 Sed还包含一个w标志，我们可以将其附加到模式命令以更快地写入文件（而不是使用e进行炮轰），但我不知道有任何方法可以通过国旗的变量参数。

或者，我们可以将程序的内容作为Sed的参数包含在内。这是一个更容易粘贴的压扁版本：

sed -n '/<xml>/,/<\/xml>/ {                             
    /^<id>\([0-9]\+\)<\/id>$/{s//c\1=$(( c\1 + 1 ));exec > "file\1.$c\1"/;p;n;}
    /<xml>\|<\/xml>/!{'"s/'/'\"'\"'/g;"'s/^.*$/echo '"'&'"'/;p;}                
}' < file.in | sh

它有效，但我们可以说Sed不是解决这个问题的最佳工具。 Sed的简单语言不是为这种逻辑设计的，所以代码并不漂亮，我们依靠shell来生成文件，这增加了一些开销。如果你很难使用Sed，那么这项工作可能需要更长的时间。对于性能至关重要的事情，请考虑使用其他答案中描述的工具之一。

根据问题中的信息和示例，我假设我们不希望在输出中打开和关闭<xml>标记，并且ID始终是其自身行上的数字。该实现使用数字扩展名写入文件名，该扩展名在找到重复的ID时递增（ fileID.count ， file1.1 ， file1.2 ，等等。）。如果需要，应该很容易更改这些细节。

注意：如果需要，修订历史记录包含two alternative implementations（一个使用GNU Sed，另一个使用包装脚本），为简洁起见，我删除了它。它们起作用但是不必要地缓慢或复杂。

Answer 2

GNU Awk 解决方案：

awk -v RS='<xml>|</xml>' '!(NR%2){ 
           gsub(/^[[:space:]]*|[[:space:]]*$/, ""); 
           printf "<xml>\n%s\n</xml>\n",$0 > "file"++c".xml";
           close("file"c".xml")
       }' file

查看结果：

$ head file*.xml
==> file1.xml <==
<xml>
<id>1</id> 
<!—- id is not unique since the xml data provides all the
information of an object X defined by its id at a specific point in time -->
some xml content on more than 500 lines
</xml>

==> file2.xml <==
<xml>
<id>2</id>
some xml content on more than 500 lines
</xml>

==> file3.xml <==
<xml>
<id>1</id>
some xml content on more than 500 lines
</xml>

Answer 3

perl one-liner

perl -ne 'if(s/.*(?=<xml>)//){$x++;open$fh,">file$x.xml"}if($fh){print$fh $_}if(/<\/xml>/){close$fh;undef$fh}' input.txt

如何运作

-n：这与sed -n类似，无需打印即可读取输入或参数文件
s/.*(?=<xml>)//：删除<xml>之前的左侧部分，如果匹配则评估为true

Answer 4

awk 'sub(/.*<xml>/,"<xml>") {out="file" ++i ".xml"; p=1}
     p {print > out}
     /<\/xml>/ {p=0; close(out)}
' file

如果日志中有太多xml对象，您可能会收到类似error: Too many open files的内容，因此我添加了一个可选的close文件。

使用Sed从日志文件中提取XML内容，并将每个结果转储到不同的文件

4 个答案: