用XML提取多行文本

时间:2016-06-21 15:57:31

标签: perl unix

我有一个XML,如下所示。我想在<com.eds.travel.fares.ping.response></com.eds.travel.fares.ping.response>之间提取文字。 XML以com.eds.travel.fares.ping.response开头,以com.eds.travel.fares.ping.response结尾。

<?xml version="1.0" encoding="UTF-8"?>  
<!--This is a Ping Response-->
<com.eds.travel.fares.ping.response xmlns="http://schemas.eds.com/transportation/message/ping/response" targetNamespace="http://schemas.eds.com/transportation/message/ping/response" EchoToken="00c0d1a" TimeStamp="2016-06-21T00:01:48.191" Target="Test" Version="1.07" SequenceNmbr="1466467309030" PrimaryLangID="en" RequestorCompanyCode="1y" RequestorNetworkID="as" SetLocation="zrh">
 <Headers Trailers="n">
  <Result xmlns="http://schemas.eds.com/transportation/message/fares/common" status="success" />
 </Headers>
 <DataArea>
  <Pong Message="pong" ServerHostName="usclsefam922.clt.travel.eds.com" ServerPortNumber="8024" ServerMessageCount="1" RegionName="preprod" SystemName="preprods3.1" SystemDate="20160621" SystemTime="148" CodeVersion="$Name: build-2016-06-17-1338 $" />
 </DataArea>
 <Trailers />
</com.eds.travel.fares.ping.response> 

我尝试使用下面的命令,但没有运气:

    cat file.txt | egrep "<com.eds.travel.fares.ping.response>.*</com.eds.travel.fares.ping.response>" 

请告知。

2 个答案:

答案 0 :(得分:0)

根据我的尝试,似乎egrep无法匹配多行,您可以使用pcregrep -M代替

pcregrep -M 'com.eds.travel.fares.ping.response((.|\n)*)com.eds.travel.fares.ping.response'

为我做了伎俩

答案 1 :(得分:0)

XML规则之一。不要使用正则表达式。 XML是一种上下文语言,正则表达式不能这样做。你将会遇到一个很脆弱的黑客攻击,并且当XML以完美有效的方式改变时,有一天会神秘地破解。

相反,请使用解析器。 Perl有几种选择 - 我碰巧喜欢XML::Twig作为一个很好的起点(XML::LibXML也很出色,但学习曲线更陡峭。)

为此,您只需要:

#!usr/bin/perl
use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig -> new ( comments => 'drop' )->parse ( \*DATA ); 
$twig -> set_pretty_print('indented_a');
$twig -> get_xpath('//com.eds.travel.fares.ping.response',0  ) -> print;


__DATA__
<?xml version="1.0" encoding="UTF-8"?>  
<!--This is a Ping Response-->
<com.eds.travel.fares.ping.response xmlns="http://schemas.eds.com/transportation/message/ping/response" targetNamespace="http://schemas.eds.com/transportation/message/ping/response" EchoToken="00c0d1a" TimeStamp="2016-06-21T00:01:48.191" Target="Test" Version="1.07" SequenceNmbr="1466467309030" PrimaryLangID="en" RequestorCompanyCode="1y" RequestorNetworkID="as" SetLocation="zrh">
 <Headers Trailers="n">
  <Result xmlns="http://schemas.eds.com/transportation/message/fares/common" status="success" />
 </Headers>
 <DataArea>
  <Pong Message="pong" ServerHostName="usclsefam922.clt.travel.eds.com" ServerPortNumber="8024" ServerMessageCount="1" RegionName="preprod" SystemName="preprods3.1" SystemDate="20160621" SystemTime="148" CodeVersion="$Name: build-2016-06-17-1338 $" />
 </DataArea>
 <Trailers />
</com.eds.travel.fares.ping.response> 

此输出 - 按要求:

<com.eds.travel.fares.ping.response
    EchoToken="00c0d1a"
    PrimaryLangID="en"
    RequestorCompanyCode="1y"
    RequestorNetworkID="as"
    SequenceNmbr="1466467309030"
    SetLocation="zrh"
    Target="Test"
    TimeStamp="2016-06-21T00:01:48.191"
    Version="1.07"
    targetNamespace="http://schemas.eds.com/transportation/message/ping/response"
    xmlns="http://schemas.eds.com/transportation/message/ping/response">
  <Headers Trailers="n">
    <Result
        status="success"
        xmlns="http://schemas.eds.com/transportation/message/fares/common"
    />
  </Headers>
  <DataArea>
    <Pong
        CodeVersion="$Name: build-2016-06-17-1338 $"
        Message="pong"
        RegionName="preprod"
        ServerHostName="usclsefam922.clt.travel.eds.com"
        ServerMessageCount="1"
        ServerPortNumber="8024"
        SystemDate="20160621"
        SystemName="preprods3.1"
        SystemTime="148"
    />
  </DataArea>
  <Trailers/>
</com.eds.travel.fares.ping.response>

基本上所有的XML,更少的标题和注释。从技术上讲,你要求的是什么,但有点微不足道。虽然注意重新格式化 - 重新格式化XML 完全有效。这就是基于regex的解决方案破解的原因。

那么如何:

#!usr/bin/perl
use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig -> parsefile ( 'file.txt' ); 

foreach my $pong ( $twig -> get_xpath('//Pong' ) ) { 
    foreach my $key( keys %{$pong -> atts} ) { 
        print "$key => ", $pong -> att($key),"\n";
    }
}

与您的源数据相关,打印:

CodeVersion => $Name: build-2016-06-17-1338 $
RegionName => preprod
SystemTime => 148
ServerHostName => usclsefam922.clt.travel.eds.com
SystemDate => 20160621
SystemName => preprods3.1
ServerMessageCount => 1
ServerPortNumber => 8024
Message => pong