Perl读取行和下一行

时间:2015-04-30 08:14:56

标签: regex xml perl parsing

我需要解析一个XML文件。我需要花时间代码(开始和结束)和与这个时间相关的句子。

XML文件是这样的:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans scribe="jj" audio_filename="01" version="1" version_date="150211">
 <Episode>
  <Section type="report" startTime="0" endTime="50.28281021118164">
   <Turn startTime="0" endTime="50.28281021118164">
    <Sync time="0"/>

    <Sync time="1.195"/>
    Something
    <Sync time="2.654"/>
    Something 2
    <Sync time="4.356"/>
    Something 3
    <Sync time="9.321"/>
    Something 4
    <Sync time="22.171"/>
    Something 5
    <Sync time="28.351"/>
    Something 6
    <Sync time="35.708"/>
    Something 7
    <Sync time="43.04"/>
    Something 8
   </Turn>
  </Section>
</Episode>

我在Perl中尝试了这个,但效果不好:

#!/usr/bin/perl -nw
next if ! /<Sync/;
$stime = "";
$sentence = "";
$etime = "";

$stime = $1 if (/Sync time="([0-9]+\.[0-9]*)"/);
$sentence = <>;
chomp($sentence);

if ($stime eq ''){ $stime = 0;}

print "$stime  $sentence\n";
__END__

因为我想要的输出格式是:

0  1.195
1.195 2.654 Something
2.654 4.356 Something 2
4.356 9.321 Something 3
9.321 22.171 Something 4
22.171 28.351 Something 5
28.351 35.708 Something 6
35.708 43.04 Something 7
43.04 endTime Something 8

非常感谢

2 个答案:

答案 0 :(得分:2)

首先 - 对XML进行面向行的解析是非常糟糕的juju。 XML是一种数据格式,结构非常重要 - 因此有一些东西可以用完全有效的方式重构它,它会破坏。

所以你的首发10:

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig -> new -> parsefile ( 'sample.xml' );

my $previous_sync = 0; 
foreach my $sync ( $twig -> get_xpath('Episode/Section/Turn/Sync') ) {
   my $sync_time =  $sync -> att('time');
   print "$previous_sync $sync_time ", $sync->text,"\n";
   $previous_sync = $sync_time;
}
print "$previous_sync ", $twig -> get_xpath('Episode/Section',0) -> att('endTime'),"\n";

现在,我遇到一个小问题,因为你的'Somethings'实际上并没有与相应的'sync'元素相关联。它们是父Turn的'文字内容'。 (Sync元素是一元标记)。

但希望这说明了解析XML的更好方法?

编辑:更新以按原样处理文字。 注意:我必须修改您的XML以包含</Trans>作为最后一行,例如:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans scribe="jj" audio_filename="01" version="1" version_date="150211">
 <Episode>
  <Section type="report" startTime="0" endTime="50.28281021118164">
   <Turn startTime="0" endTime="50.28281021118164">
    <Sync time="0"/>

    <Sync time="1.195"/>
    Something
    <Sync time="2.654"/>
    Something 2
    <Sync time="4.356"/>
    Something 3
    <Sync time="9.321"/>
    Something 4
    <Sync time="22.171"/>
    Something 5
    <Sync time="28.351"/>
    Something 6
    <Sync time="35.708"/>
    Something 7
    <Sync time="43.04"/>
    Something 8
   </Turn>
  </Section>
</Episode>
</Trans>

因此,如果仍然看起来没问题,并且您实际上并未尝试使用损坏的XML,则会提供所需的输出。

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig;

my $previous_sync;

sub handle_sync {
    my ( $twig, $sync ) = @_;
    my $sync_time = $sync->att('time');
    if ( not defined $previous_sync ) {
        $previous_sync = $sync_time;
        return;
    }
    print "$previous_sync $sync_time ";
    $previous_sync = $sync_time;
    my (@sync_text) = split( "\n", $sync->parent->text );
    pop(@sync_text);    #discard blank line.
    my $line = pop(@sync_text);
    if ( defined $line ) {
        $line =~ s/^\s+//g;
        print $line;
    }
    print "\n";
}

my $twig = XML::Twig->new( twig_handlers => { 'Sync' => \&handle_sync } )
    ->parsefile('sample.xml');
print "$previous_sync ",
    $twig->get_xpath( 'Episode/Section', 0 )->att('endTime'), " ";

my @sync_text =
    split( "\n", $twig->get_xpath( 'Episode/Section/Turn', 0 )->text );
my $line = $sync_text[-2];
$line =~ s/^\s+//g;
print $line, "\n";

这有点像软糖,因为那里的'text'是Turn元素的一部分 - 所以我采用'打印最后一条(完整)线'方法。这似乎有效,但如果你在那里有多行,可能不会。

答案 1 :(得分:0)

使用XML::XSH2XML::LibXML的包装:

open sample.xml ;
for //Sync
    echo @time normalize-space(following-sibling::node()[1][self::text()]) ;