如果包含相同的单词,则从xml文件中删除行(perl)

时间:2012-01-13 15:55:31

标签: perl

我有一个文件“frequencies.xml”,其中包含以下格式的行:

<?xml version="1.0"?>
<!DOCTYPE stationlist PUBLIC "-//xxxxx//DTD stationlist 1.0//EN"   "http://xxxxxxxxx/DTD/xxxxxxxx.dtd">
<frequencies xmlns="http://xxxxxxxxxxxxxxxx/DTD/">
 <list norm="PAL" frequencies="Custom" audio="bg">
..............................................................
<station name="A" active="1" channel="48.25MHz" norm="PAL"/>
<station name="B" active="1" channel="55.25MHz" norm="PAL"/>
<station name="C" active="1" channel="62.25MHz" norm="PAL"/>
<station name="D" active="1" channel="112.25MHz" norm="PAL"/>
..............................................................
<station name="E" active="1" channel="119.25MHz" norm="PAL"/>
<station name="F" active="0" channel="48.25MHz" norm="PAL"/>
..............................................................
<station name="G" active="1" channel="55.25MHz" norm="PAL"/>
<station name="H" active="0" channel="62.25MHz" norm="PAL"/>
..............................................................
  </list>
 </frequencies>

如果包含与其他行相同的频率,我想删除被视为重复的行。

输出结果:

<station name="A" active="1" channel="48.25MHz" norm="PAL"/>
<station name="B" active="1" channel="55.25MHz" norm="PAL"/>
<station name="C" active="1" channel="62.25MHz" norm="PAL"/>
<station name="D" active="1" channel="112.25MHz" norm="PAL"/>
<station name="E" active="1" channel="119.25MHz" norm="PAL"/>

我编写脚本来执行此操作:

for i in `cat frequencies.xml | sed 's/.*channel="\([^"]*\)".*/\1/; /</ d' |grep MHz`; do
cat frequencies.xml | awk -v i="channel=\"$i" '
    BEGIN       { a=0 }
    $0 ~ i      { if ( a == "1" ) { print i"\" - duplicate" > "/dev/stderr"  ; next ;} ; a=1 } 
            { print $_ }' > frequencies.xml.tmp && \
mv frequencies.xml.tmp frequencies.xml
done

如何用perl语言转换这个?

由于

更新:我想保留XML结构。

我的代码:

open (FH, "+< frequencies.xml") or die "Opening: $!";
my $out = '';
my %seen = ();
foreach my $line ( <FH> ) {
   if ( $line =~ m/<station/ ) {
        my ( $freq ) = ( $line =~ m/channel="([^"]+)"/ );
            $out .= $line unless $seen{$freq}++;
    } else {
        $out .= $line;
    }
}
seek(FH,0,0)                    or die "Seeking: $!";
print FH $out                   or die "Printing: $!";
truncate(FH, tell(FH))          or die "Truncating: $!";
close(FH)                       or die "Closing: $!";

5 个答案:

答案 0 :(得分:3)

保持哈希值以跟踪您所看到的频率,如果您已经看过它,请不要发出该行:

open INPUT, '<', 'frequencies.xml' or die "Can't read file : $!";
my %seen = ();
foreach my $line ( <INPUT> ) {
   my ( $freq ) = ( $line =~ m/channel="([^"]+)"/ );
   print $line unless $seen{$freq};
   $seen{$freq}++;
}
close INPUT;

更新

如果要保留其他线条,您只需要打印它们即可。如果它是一个<station>元素,最简单的方法就是进行测试,然后打印其他所有内容......但是一旦开始变得比这更复杂,你可能想要使用真正的{{3}之一}。所以,使用Zaid的建议:

open INPUT, '<', 'frequencies.xml' or die "Can't read file : $!";
my %seen = ();
foreach my $line ( <INPUT> ) {
   if ( $line =~ m/<station/ ) {
      my ( $freq ) = ( $line =~ m/channel="([^"]+)"/ );
      print $line unless $seen{$freq}++;
   } else {
      print $line;
   }
}
close INPUT;

答案 1 :(得分:0)

使用单行脚本的一种方法:

perl -ne '($freq) = m/(?i)channel="([^"]+)/; print unless exists $arr{ $freq }; $arr{ $freq } = 1' infile

答案 2 :(得分:0)

open(IN, '<', 'frequencies.xml') or die;
while ($inline = <IN>) {
  $inline =~ /([\d.]+)MHz/;
  $freq = $1;
  push(@out, $inline) unless (grep(/$freq/, @out));
}
print "@out\n";

答案 3 :(得分:0)

$ perl -pi.tmp -ale '$_="" if $seen{ $F[2] }++' frequencies.xml

答案 4 :(得分:0)

使用XML :: XSH2:

use XML::XSH2;
xsh q{
    open so-8853324.xml;
    $ch := hash @channel //station;
    for { keys %$ch } ls xsh:lookup("ch", .)[1];
};

我从数据中删除了命名空间以简化代码。