如何使用perl在xml文件中添加缺失范围?

时间:2014-12-05 08:53:44

标签: regex perl

我有xml文件作为输入。在这些xml文件中,有标记,例如:

初审:

<xref ref-type="bibr" rid="perl-ch006-bib080"><sup>80</sup></xref><sup>&#x2013;</sup><xref   ref-type="bibr" rid="perl-ch006-bib082"><sup>82</sup></xref>

第二个例子:

<xref ref-type="bibr" rid="perl-ch001-bib009"><sup>9</sup></xref><sup>&#x2013;</sup><xref ref-type="bibr" rid="perl-ch001-bib012"><sup>12</sup></xref><sup>,</sup><xref ref-type="bibr" rid="perl-ch001-bib057"><sup>57</sup></xref><sup>&#x2013;</sup><xref ref-type="bibr" rid="perl-ch001-bib059"><sup>59</sup></xref>

在上述两个实例中有数字80和82,其中81缺失,9-12,57-59和 - 是 - (hypen)的实体。我需要复制xml文件的整个数据,并在该特定位置添加缺少的范围。

输出应如下: 对于初审:(即在下面的模式80 81-82)

<xref ref-type="bibr" rid="perl-ch006-bib080"><sup>80</sup></xref><xref ref-type="bibr" rid="perl-ch006-bib081"><sup>81</sup></xref><sup>&#x2013;</sup><xref ref-type="bibr" rid="perl-ch006-bib082"><sup>82</sup></xref>

对于第二种情况:(即以下面的模式9 10 11-12,57 58-59)

<xref ref-type="bibr" rid="perl-ch001-bib009"><sup>9</sup></xref><xref ref-type="bibr" rid="perl-ch001-bib010"><sup>10</sup></xref><xref ref-type="bibr" rid="perl-ch001-bib011"><sup>11</sup></xref><sup>&#x2013;</sup><xref ref-type="bibr" rid="perl-ch001-bib012"><sup>12</sup></xref><sup>,</sup><xref ref-type="bibr" rid="perl-ch001-bib057"><sup>57</sup></xref><xref ref-type="bibr" rid="perl-ch001-bib058"><sup>58</sup></xref><sup>&#x2013;</sup><xref ref-type="bibr" rid="perl-ch001-bib059"><sup>59</sup></xref>

所有更改都要在输出文件中完成,这样输入文件就不会受到阻碍。

代码:

#!/usr/bin/perl
use strict;
use Cwd;
use File::Basename;
use File::Copy;

my $path1=getcwd;
opendir(INP, "$path1\/Input");
my @out = grep(/.(xml)$/,readdir(INP));
close INP;

foreach my $final(@out)
{
my $filetobecopied = "Input\/".$final;
my $newfile = $final;
copy($filetobecopied, $newfile) or die "File cannot be copied.";
}

opendir DIR, $path1 or die "cant open dir";
my @files = grep /(.*?)\.(xml)$/,(readdir DIR);
closedir DIR;

open(F6, ">Ref.txt");
print F6 "FileName\tMatchedString\tOutput\n";

foreach my $f(@files)
{
open(F1, "<$f") or die "Cannot open file: $files[0]";
my $data=join("", <F1>);
close F1;
my $xml_list=$data;
#print F6 $xml_list."\n";
$xml_list=~s/&#x2013;/-/gs;
$xml_list=~s/&#x02013;/-/gs;

while($xml_list=~m/(<xref ref-type="(bibr|bib)" rid="(.*?)-ch(\d+)-(bibr|bib)(\d+)">(<sup>)?(\d+)(<\/sup>)?<\/xref><sup>(-)+<\/sup>)(<xref ref-type="(bibr|bib)" rid="(.*?)-ch(\d+)-bib(\d+)">(<sup>)?(\d+)(<\/sup>)?<\/xref>)/igs)
{
my $i;
my $xref=$1;my $bibr=$2;
my $rid=$3; my $ch=$4;my $bib=$6;my $hyp=$10;
my $num=$8;
my $xref1=$11;
my $num1=$17;

if($hyp=~m/(-)/gs)
{
my $counter=$num;
while($counter<=$num1)   #for($counter=$num;$counter<=$num1;$counter++)
{
#print F6 "<xref ref-type=\"$bibr\" rid=\"$rid\-ch$ch\-$bibr$counter\"><sup>$counter<\/sup><\/xref>,"."\n";
$counter++;
}
}
}

$xml_list=~s/&orb;/\(/g;
$xml_list=~s/&crb;/\)/g;
$xml_list=~s/-/&#x2013;/gs;
$xml_list=~s/-/&#x02013;/gs;

open(OUT, ">$path1\/Output\/$f");
print OUT $xml_list;
close OUT
}
foreach my $del(@files)
{
unlink $del
}

任何帮助将不胜感激..

1 个答案:

答案 0 :(得分:0)

你的程序已经相当远了。主要缺少的只是在正确位置添加缺失的xref部分。添加到$xml_list可以使用substr完成;插入的偏移量可以从@LAST_MATCH_END数组中获得。然后代码的核心变为:

#$xml_list=~s/&#x2013;/-/gs;    don't do this (gives trouble when changing back)
#$xml_list=~s/&#x02013;/-/gs;   don't do this (gives trouble when changing back)

while ($xml_list=~/(<xref\ ref-type="(bibr?)"\ rid="(.*?)-ch(\d+)-(bibr?)(\d+)">
                       (<sup>)?(\d+)(<\/sup>)?
                    <\/xref>)<sup>(&\#x0?2013;)+<\/sup>
                   (<xref\ +ref-type="(bibr?)"\ rid="(.*?)-ch(\d+)-bib(\d+)">
                       (<sup>)?(\d+)(<\/sup>)?
                    <\/xref>)
                  /igsx)
{
    my $insert=$+[1];   # end of first (<xref.../xref>) submatch; here we insert
    my ($bibr,$rid,$ch,$bib)=($2,$3,$4,$5.$6);
    my $num=$8;
    my $num1=$17;
    my $endpos = pos $xml_list;
    for (my $counter=$num; ++$counter<$num1; )
    {
        ++$bib;
        my $insertion = "<xref ref-type=\"$bibr\" rid=\"$rid-ch$ch-$bib\">"
                           ."<sup>$counter</sup>"
                       ."</xref>\n";    # insert this into $xml_list at $insert 
        substr $xml_list, $insert, 0, $insertion;
        $insert += length $insertion;   # push start of next insert to the right
        $endpos += length $insertion;   # push start of next search to the right
    }
    pos $xml_list = $endpos;    # set start position of next search
}

#$xml_list=~s/-/&#x2013;/gs;    trouble: would also change normal hyphens