Question

我需要从日志文件中提取如下所示的请求：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<vehicleRegistration>
.... XML in between ....
.... XML in between ....
.... XML in between ....
.... XML in between ....
... at nth line there is line like this <vehicle id="2312313"></vehicle>
.... XML in between ....
.... XML in between ....
</vehicleRegistration>

重要的问题是，车辆登记可以是5行，有时是17行，其可变。这是我目前的grep失败的地方，我用过：

grep -A 13 "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>" vehicle.log

另一个问题是，有时请求可以发送2次或更多次，因为服务可能由于某种原因而不可用，因此文件中可能存在相同的多个请求。

我还应该排除重复请求，通过比较第n行（不是最后一行）<vehicle id="2312313"></vehicle>来判断请求是否重复的方法，如果车辆ID重复而不是重复请求。

你解决这个问题的方法是什么？建议，代码，伪代码，欢迎任何事情。

编辑：

日志文件不是xml文件，它只是一个包含一小部分xml请求的文件，我无法将其解析为XML

编辑II：

我只使用@eugene y一行命令perl -nle 'm{<vehicleRegistration>} .. m{</vehicleRegistration>} and print' logfile提取车辆登记部分，如何摆脱重复，那些具有相同车辆ID的节点，我只想保留其中一个副本。

Answer 1

我使用XML::Simple（或其他XML解析器）来提取数据。 Data::Dumper可用于检查数据结构。

更新：您可以像这样提取vehicleRegistration元素：

open my $fh, '<', 'logfile' or die $!;     
my $xml = ""; 

while (<$fh>) {
    if ( m{<vehicleRegistration>} .. m{</vehicleRegistration>}) {
        $xml .= $_; 
    }   
}

或使用perl one-liner：

perl -nle 'm{<vehicleRegistration>} .. m{</vehicleRegistration>} and print' logfile

Answer 2

使用unix中的awk或gawk命令来提取注册...

#!/usr/bin/awk -f 

/^<vehicleRegistration>/ { printit="true" } # set the print flag on
printit ~ "true" { print }                  # if printflag set print
/^</vehicleRegistration>{ printit="false" } # turn print flag off

Answer 3

使用XPath（根据您对结果的处理方式，可能Xslt）

有一个命令行实用程序，here, for example

Answer 4

使用XPath恢复XML元素节点。各种现代脚本语言有很多框架。

使用Perl，您可能会执行以下操作：

#!/usr/bin/perl

use strict;
use warnings;
use XML::XPath;

my $file = 'vehicleRegistration.xml';
my $xp = XML::XPath->new(filename => $file);

print "Vehicle id: ".$xp->find('//vehicle/@id')."\n";

如果需要，解析日志文件以解压缩XML文档部分，然后在其上运行XPath表达式以恢复所需的元素和数据。

从日志中提取某些模式

4 个答案: