我必须根据模式从XML文件中提取一些信息。我确实完成了一个工作脚本,但我很确定它可以更简单和/或更清洁。
你能告诉我哪些更好,为什么?
<modifs>
<modif id="14661"><code c="1" /><extra id="109816" /><avant num_words="1">démissionné</avant><apres num_words="1">démissionner</apres></modif>
<modif id="125247"><code c="1" /><avant num_words="1">demis-tons</avant><apres num_words="1">demi-tons</apres></modif>
<modif id="90891"><code c="1" /><avant num_words="1">démit</avant><apres num_words="1">démis</apres></modif>
<modif id="198379"><code c="1" /><avant num_words="1">demi-terain</avant><apres num_words="1">demi-terrain</apres></modif>
<modif id="172795"><code c="1" /><avant num_words="1">demi-ton</avant><apres num_words="1">demi-tons</apres></modif>
</modifs>
当avant
和apres
标记的内容以-é
或-er
结尾时,显示每个id
和extra id
,然后是avant
和apres
的内容。
所以它看起来像这样:
id="14661"
extra id="109816"
démissionné |||| démissionner
use strict;
use warnings;
my $fichier = 'path';
my $fichiersortie = "path";
my @lignes ;
my @tableau_avant ;
my @tableau_apres ;
my @ids ;
my @extraids ;
my @radical_avant ;
my @radical_apres ;
open (OUTPUT, ">$fichiersortie");
binmode(OUTPUT, ":utf8");
open(my $fh, '<:encoding(UTF-8)', $fichier)
or die "Can't open file";
while (my $row = <$fh>) {
chomp $row;
@radical_avant = $row =~ /<avant.+?>(.+?)(?:er|é)<\/avant>/;
@radical_apres = $row =~ /<apres.+?>(.+?)(?:er|é)<\/apres>/ ;
@tableau_avant = $row =~ /<avant.+?>(.+?(?:er|é))<\/avant>/;
@tableau_apres = $row =~ /<apres.+?>(.+?(?:er|é))<\/apres>/ ;
@ids = $row =~ /<modif (id="\d+")>/ ;
@extraids = $row =~ /<(extra id="\d+")\s\/>/g ;
foreach my $id (@ids) {
foreach my $match_avant (@tableau_avant) {
foreach my $match_apres (@tableau_apres) {
foreach my $radical_avant (@radical_avant){
foreach my $radical_apres (@radical_apres){
if ($radical_avant eq $radical_apres) {
print OUTPUT "$id\n";
foreach my $extraid (@extraids) {
print OUTPUT "$extraid\n";}
print OUTPUT "$match_avant" . " |||| " . "$match_apres\n\n" ;}
}
}
}
}
}
}
close (OUTPUT);
整理,Perl代码看起来像这样
use strict;
use warnings;
my $fichier = 'path';
my $fichiersortie = "path";
my @lignes;
my @tableau_avant;
my @tableau_apres;
my @ids;
my @extraids;
my @radical_avant;
my @radical_apres;
open( OUTPUT, ">$fichiersortie" );
binmode( OUTPUT, ":utf8" );
open( my $fh, '<:encoding(UTF-8)', $fichier ) or die "Can't open file";
while ( my $row = <$fh> ) {
chomp $row;
@radical_avant = $row =~ /<avant.+?>(.+?)(?:er|é)<\/avant>/;
@radical_apres = $row =~ /<apres.+?>(.+?)(?:er|é)<\/apres>/;
@tableau_avant = $row =~ /<avant.+?>(.+?(?:er|é))<\/avant>/;
@tableau_apres = $row =~ /<apres.+?>(.+?(?:er|é))<\/apres>/;
@ids = $row =~ /<modif (id="\d+")>/;
@extraids = $row =~ /<(extra id="\d+")\s\/>/g;
foreach my $id (@ids) {
foreach my $match_avant (@tableau_avant) {
foreach my $match_apres (@tableau_apres) {
foreach my $radical_avant (@radical_avant) {
foreach my $radical_apres (@radical_apres) {
if ( $radical_avant eq $radical_apres ) {
print OUTPUT "$id\n";
foreach my $extraid (@extraids) {
print OUTPUT "$extraid\n";
}
print OUTPUT "$match_avant" . " |||| " . "$match_apres\n\n";
}
}
}
}
}
}
}
close(OUTPUT);
答案 0 :(得分:3)
不要使用regular expressions to parse XML。它导致脆弱的代码。
perl有一个XML解析器,看起来像这样:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
#load the file into the XML parser, as $twig
my $twig = XML::Twig->new->parsefile('your_file.xml');
#iterate elements '<modif>' - anywhere in data structure.
#(That's what // prefix means in xpath)
foreach my $modif ( $twig->get_xpath('//modif') ) {
#For each modif element, extract the contents of 'avant' and 'apres' and
#compare them.
if ( $modif->first_child_text('avant') =~ m/(er|é)$/
and $modif->first_child_text('apres') =~ m/(er|é)$/ )
{
#from this element, get the 'id' attribute.
#<modif id="???">
print "ID: ", $modif->att('id'), "\n";
#fetch all the children of <modif> called '<extra>'
#use map to fetch the 'id' attributes of all of these. (if more than one)
print "extra ids", join " ",(map { $_->att('id') } $modif->children('extra')), "\n";
#fetch content of '<avant>' and '<apres>' nodes.
print $modif->first_child_text('avant'), "|||", $modif ->first_child_text('apres'),"\n";
}
}
我对你所做的比较并不完全清楚,但希望这足以说明如何做到这一点?
e.g。也许你正在比较:
#compare both avant and apres, but only after trimming
#a couple of letters off the end.
if ( $modif -> first_child_text('avant') =~ s/(er|é)$//r
eq $modif -> first_child_text('apres') =~ s/(er|é)$//r ) {
在将这些角色放弃之后将两者进行比较。 (注意 - r
正则表达式修饰符需要一个新版本的perl才能工作,它可能会出错)