简化Perl脚本以使用正则表达式

时间:2016-07-06 12:59:03

标签: regex xml perl

我必须根据模式从XML文件中提取一些信息。我确实完成了一个工作脚本,但我很确定它可以更简单和/或更清洁。

你能告诉我哪些更好,为什么?

我的输入是什么样的:

<modifs>
  <modif id="14661"><code c="1" /><extra id="109816" /><avant num_words="1">démissionné</avant><apres num_words="1">démissionner</apres></modif>
  <modif id="125247"><code c="1" /><avant num_words="1">demis-tons</avant><apres num_words="1">demi-tons</apres></modif>
  <modif id="90891"><code c="1" /><avant num_words="1">démit</avant><apres num_words="1">démis</apres></modif>
  <modif id="198379"><code c="1" /><avant num_words="1">demi-terain</avant><apres num_words="1">demi-terrain</apres></modif>
  <modif id="172795"><code c="1" /><avant num_words="1">demi-ton</avant><apres num_words="1">demi-tons</apres></modif>
</modifs>

我想要的:+

avantapres标记的内容以-er结尾时,显示每个idextra id,然后是avantapres的内容。

所以它看起来像这样:

id="14661"
extra id="109816"
démissionné |||| démissionner

我的脚本是什么样的:

 use strict;
    use warnings;


    my $fichier = 'path';
    my $fichiersortie = "path";
    my @lignes ;
    my @tableau_avant ;
    my @tableau_apres ;
    my @ids ;
    my @extraids ; 
    my @radical_avant ;
    my @radical_apres ;

    open (OUTPUT, ">$fichiersortie");
    binmode(OUTPUT, ":utf8");
    open(my $fh, '<:encoding(UTF-8)', $fichier)
      or die "Can't open file";

    while (my $row = <$fh>) {
        chomp $row;

        @radical_avant = $row =~ /<avant.+?>(.+?)(?:er|é)<\/avant>/;
        @radical_apres = $row =~ /<apres.+?>(.+?)(?:er|é)<\/apres>/ ;
        @tableau_avant = $row =~ /<avant.+?>(.+?(?:er|é))<\/avant>/;
        @tableau_apres = $row =~ /<apres.+?>(.+?(?:er|é))<\/apres>/ ;
        @ids = $row =~ /<modif (id="\d+")>/ ;
        @extraids = $row =~ /<(extra id="\d+")\s\/>/g ;


        foreach my $id (@ids) {
        foreach my $match_avant (@tableau_avant) {
        foreach my $match_apres (@tableau_apres) {

        foreach my $radical_avant (@radical_avant){
        foreach my $radical_apres (@radical_apres){
        if ($radical_avant eq $radical_apres) {

        print OUTPUT "$id\n";
foreach my $extraid (@extraids) {
        print OUTPUT "$extraid\n";}
        print OUTPUT "$match_avant" . " |||| " . "$match_apres\n\n" ;}
        }
        }
                }
        }
        }
}
close (OUTPUT);


整理,Perl代码看起来像这样

use strict;
use warnings;

my $fichier       = 'path';
my $fichiersortie = "path";
my @lignes;
my @tableau_avant;
my @tableau_apres;
my @ids;
my @extraids;
my @radical_avant;
my @radical_apres;

open( OUTPUT, ">$fichiersortie" );
binmode( OUTPUT, ":utf8" );

open( my $fh, '<:encoding(UTF-8)', $fichier ) or die "Can't open file";

while ( my $row = <$fh> ) {
    chomp $row;

    @radical_avant = $row =~ /<avant.+?>(.+?)(?:er|é)<\/avant>/;
    @radical_apres = $row =~ /<apres.+?>(.+?)(?:er|é)<\/apres>/;
    @tableau_avant = $row =~ /<avant.+?>(.+?(?:er|é))<\/avant>/;
    @tableau_apres = $row =~ /<apres.+?>(.+?(?:er|é))<\/apres>/;
    @ids           = $row =~ /<modif (id="\d+")>/;
    @extraids      = $row =~ /<(extra id="\d+")\s\/>/g;

    foreach my $id (@ids) {

        foreach my $match_avant (@tableau_avant) {

            foreach my $match_apres (@tableau_apres) {

                foreach my $radical_avant (@radical_avant) {

                    foreach my $radical_apres (@radical_apres) {

                        if ( $radical_avant eq $radical_apres ) {

                            print OUTPUT "$id\n";

                            foreach my $extraid (@extraids) {
                                print OUTPUT "$extraid\n";
                            }

                            print OUTPUT "$match_avant" . " |||| " . "$match_apres\n\n";
                        }
                    }
                }
            }
        }
    }
}

close(OUTPUT);

1 个答案:

答案 0 :(得分:3)

不要使用regular expressions to parse XML。它导致脆弱的代码。

perl有一个XML解析器,看起来像这样:

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

#load the file into the XML parser, as $twig
my $twig = XML::Twig->new->parsefile('your_file.xml'); 

#iterate elements '<modif>' - anywhere in data structure. 
#(That's what // prefix means in xpath)
foreach my $modif ( $twig->get_xpath('//modif') ) {

   #For each modif element, extract the contents of 'avant' and 'apres' and
   #compare them. 
   if (   $modif->first_child_text('avant') =~ m/(er|é)$/
      and $modif->first_child_text('apres') =~ m/(er|é)$/ )
   {
      #from this element, get the 'id' attribute. 
      #<modif id="???">
      print "ID: ", $modif->att('id'), "\n";
      #fetch all the children of <modif> called '<extra>' 
      #use map to fetch the 'id' attributes of all of these. (if more than one)
      print "extra ids", join " ",(map { $_->att('id') } $modif->children('extra')), "\n";
      #fetch content of '<avant>' and '<apres>' nodes. 
      print $modif->first_child_text('avant'), "|||", $modif ->first_child_text('apres'),"\n";
   }
}

我对你所做的比较并不完全清楚,但希望这足以说明如何做到这一点?

e.g。也许你正在比较:

#compare both avant and apres, but only after trimming
#a couple of letters off the end. 
if ( $modif -> first_child_text('avant') =~ s/(er|é)$//r 
  eq $modif -> first_child_text('apres') =~ s/(er|é)$//r  ) {

在将这些角色放弃之后将两者进行比较。 (注意 - r正则表达式修饰符需要一个新版本的perl才能工作,它可能会出错)