我有xml文件作为输入。在这些xml文件中,有标记,例如:
初审:
<xref ref-type="bibr" rid="perl-ch006-bib080"><sup>80</sup></xref><sup>–</sup><xref ref-type="bibr" rid="perl-ch006-bib082"><sup>82</sup></xref>
第二个例子:
<xref ref-type="bibr" rid="perl-ch001-bib009"><sup>9</sup></xref><sup>–</sup><xref ref-type="bibr" rid="perl-ch001-bib012"><sup>12</sup></xref><sup>,</sup><xref ref-type="bibr" rid="perl-ch001-bib057"><sup>57</sup></xref><sup>–</sup><xref ref-type="bibr" rid="perl-ch001-bib059"><sup>59</sup></xref>
在上述两个实例中有数字80和82,其中81缺失,9-12,57-59和 - 是 - (hypen)的实体。我需要复制xml文件的整个数据,并在该特定位置添加缺少的范围。
输出应如下: 对于初审:(即在下面的模式80 81-82)
<xref ref-type="bibr" rid="perl-ch006-bib080"><sup>80</sup></xref><xref ref-type="bibr" rid="perl-ch006-bib081"><sup>81</sup></xref><sup>–</sup><xref ref-type="bibr" rid="perl-ch006-bib082"><sup>82</sup></xref>
对于第二种情况:(即以下面的模式9 10 11-12,57 58-59)
<xref ref-type="bibr" rid="perl-ch001-bib009"><sup>9</sup></xref><xref ref-type="bibr" rid="perl-ch001-bib010"><sup>10</sup></xref><xref ref-type="bibr" rid="perl-ch001-bib011"><sup>11</sup></xref><sup>–</sup><xref ref-type="bibr" rid="perl-ch001-bib012"><sup>12</sup></xref><sup>,</sup><xref ref-type="bibr" rid="perl-ch001-bib057"><sup>57</sup></xref><xref ref-type="bibr" rid="perl-ch001-bib058"><sup>58</sup></xref><sup>–</sup><xref ref-type="bibr" rid="perl-ch001-bib059"><sup>59</sup></xref>
所有更改都要在输出文件中完成,这样输入文件就不会受到阻碍。
代码:
#!/usr/bin/perl
use strict;
use Cwd;
use File::Basename;
use File::Copy;
my $path1=getcwd;
opendir(INP, "$path1\/Input");
my @out = grep(/.(xml)$/,readdir(INP));
close INP;
foreach my $final(@out)
{
my $filetobecopied = "Input\/".$final;
my $newfile = $final;
copy($filetobecopied, $newfile) or die "File cannot be copied.";
}
opendir DIR, $path1 or die "cant open dir";
my @files = grep /(.*?)\.(xml)$/,(readdir DIR);
closedir DIR;
open(F6, ">Ref.txt");
print F6 "FileName\tMatchedString\tOutput\n";
foreach my $f(@files)
{
open(F1, "<$f") or die "Cannot open file: $files[0]";
my $data=join("", <F1>);
close F1;
my $xml_list=$data;
#print F6 $xml_list."\n";
$xml_list=~s/–/-/gs;
$xml_list=~s/–/-/gs;
while($xml_list=~m/(<xref ref-type="(bibr|bib)" rid="(.*?)-ch(\d+)-(bibr|bib)(\d+)">(<sup>)?(\d+)(<\/sup>)?<\/xref><sup>(-)+<\/sup>)(<xref ref-type="(bibr|bib)" rid="(.*?)-ch(\d+)-bib(\d+)">(<sup>)?(\d+)(<\/sup>)?<\/xref>)/igs)
{
my $i;
my $xref=$1;my $bibr=$2;
my $rid=$3; my $ch=$4;my $bib=$6;my $hyp=$10;
my $num=$8;
my $xref1=$11;
my $num1=$17;
if($hyp=~m/(-)/gs)
{
my $counter=$num;
while($counter<=$num1) #for($counter=$num;$counter<=$num1;$counter++)
{
#print F6 "<xref ref-type=\"$bibr\" rid=\"$rid\-ch$ch\-$bibr$counter\"><sup>$counter<\/sup><\/xref>,"."\n";
$counter++;
}
}
}
$xml_list=~s/&orb;/\(/g;
$xml_list=~s/&crb;/\)/g;
$xml_list=~s/-/–/gs;
$xml_list=~s/-/–/gs;
open(OUT, ">$path1\/Output\/$f");
print OUT $xml_list;
close OUT
}
foreach my $del(@files)
{
unlink $del
}
任何帮助将不胜感激..
答案 0 :(得分:0)
你的程序已经相当远了。主要缺少的只是在正确位置添加缺失的xref
部分。添加到$xml_list
可以使用substr
完成;插入的偏移量可以从@LAST_MATCH_END
数组中获得。然后代码的核心变为:
#$xml_list=~s/–/-/gs; don't do this (gives trouble when changing back)
#$xml_list=~s/–/-/gs; don't do this (gives trouble when changing back)
while ($xml_list=~/(<xref\ ref-type="(bibr?)"\ rid="(.*?)-ch(\d+)-(bibr?)(\d+)">
(<sup>)?(\d+)(<\/sup>)?
<\/xref>)<sup>(&\#x0?2013;)+<\/sup>
(<xref\ +ref-type="(bibr?)"\ rid="(.*?)-ch(\d+)-bib(\d+)">
(<sup>)?(\d+)(<\/sup>)?
<\/xref>)
/igsx)
{
my $insert=$+[1]; # end of first (<xref.../xref>) submatch; here we insert
my ($bibr,$rid,$ch,$bib)=($2,$3,$4,$5.$6);
my $num=$8;
my $num1=$17;
my $endpos = pos $xml_list;
for (my $counter=$num; ++$counter<$num1; )
{
++$bib;
my $insertion = "<xref ref-type=\"$bibr\" rid=\"$rid-ch$ch-$bib\">"
."<sup>$counter</sup>"
."</xref>\n"; # insert this into $xml_list at $insert
substr $xml_list, $insert, 0, $insertion;
$insert += length $insertion; # push start of next insert to the right
$endpos += length $insertion; # push start of next search to the right
}
pos $xml_list = $endpos; # set start position of next search
}
#$xml_list=~s/-/–/gs; trouble: would also change normal hyphens