如何在xml标签之间提取内容?

时间:2013-12-25 05:30:11

标签: xml regex perl xml-parsing perl-module

我有这样的xml数据我想要ce中的提取内容:afffilliation和sa:提取后的afffilliation放入两个变量和sa:affilliation变量使文本像ce:affillition和compare two text

<ce:affiliation id="aff1"><ce:label>a</ce:label><ce:textfn>Department of Urology, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands</ce:textfn><sa:affiliation><sa:organization>Department of Urology</sa:organization><sa:organization>Radboud University Nijmegen Medical Center</sa:organization><sa:city>Nijmegen</sa:city><sa:country>The Netherlands</sa:country></sa:affiliation></ce:affiliation><ce:affiliation id="aff2"><ce:label>b</ce:label><ce:textfn>Norris Comprehensive Cancer Center, University of Southern California Institute of Urology, Los Angeles, California</ce:textfn><ce:affiliation id="aff3"><ce:label>c</ce:label><ce:textfn>Department of Urology, Stanford University, Stanford, California</ce:textfn><sa:affiliation><sa:organization>Department of Urology</sa:organization><sa:organization>Stanford University</sa:organization><sa:city>Stanford</sa:city><sa:state>California</sa:state></sa:affiliation></ce:affiliation>


#!/usr/bin/perl  
@files = <*.xml>;
open my $out, '>', 'output.xml' or die $!;
foreach $file (@files) {
open   (FILE, "$file");
$a =1;
while(my $line= <FILE> ){
do{
if ($line =~ /<ce:affiliation id=\"aff$a\">(.+?)<ce:textfn>(.+?)<\/ce:textfn><sa:affiliation>(.+?)<\/sa:affiliation><\/ce:affiliation>/){
$count = $3;
$textfn = $2;
print ("$count\n");
print ("$textfn\n");
if ($count =~ /<\/sa:(.+?)>/){
$count =~ s/<\/sa:organization>/, /g;
$count =~ s/<\/sa:city>/, /g;
$count =~ s/<\/sa:country>/, /g;
$count =~ s/<\/sa:state>/, /g;
$count =~ s/<sa:organization>//g;
$count =~ s/<sa:city>//g;
$count =~ s/<sa:country>//g;
$count =~ s/<sa:state>//g;
chop($count);
chop($count);
if($count ne $textfn){
print $out("$file affilliation $a is mismatch\n");}}}
else{
if($line =~ /<ce:affiliation id=\"aff$a\">(.+?)<ce:textfn>(.+?)<\/ce:textfn><\/ce:affiliation>/){
print $out("$file sa:affilliation missing for $a\n");}}
$a=$a+1;}
while($line =~ /aff$a/);}}

如果某些ce:affillition不包含

,则此代码失败
<ce:label> and <sa:affillition> 

<ce:affiliation id="aff1"><ce:label>a</ce:label><ce:textfn>Department of Urology, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands</ce:textfn><sa:affiliation><sa:organization>Department of Urology</sa:organization><sa:organization>Radboud University Nijmegen Medical Center</sa:organization><sa:city>Nijmegen</sa:city><sa:country>The Netherlands</sa:country></sa:affiliation></ce:affiliation><ce:affiliation id="aff2"><ce:textfn>Norris Comprehensive Cancer Center, University of Southern California Institute of Urology, Los Angeles, California</ce:textfn></ce:affiliation><ce:affiliation id="aff3"><ce:label>c</ce:label><ce:textfn>Department of Urology, Stanford University, Stanford, California</ce:textfn><sa:affiliation><sa:organization>Department of Urology</sa:organization><sa:organization>Stanford University</sa:organization><sa:city>Stanford</sa:city><sa:state>California</sa:state></sa:affiliation></ce:affiliation><ce:correspondence id="cor1"></article>

1 个答案:

答案 0 :(得分:3)

请不要使用正则表达式来解析XML。它将在简单的情况下工作,并且tchrist证明了你可以使它在一般情况下工作 - 尽管你真的在那时围绕正则表达式编写自己的XML解析器 - 但它更容易只需使用为此目的而编写的库。

示例:

use XML::LibXML;

my $parser = XML::LibXML->new;
my $doc = $parser->parse_file('output.xml');
my @badnodes;
foreach my $affil ($doc->findnodes("//*[name()='ce:affiliation']")) {
   push(@badnodes, $affil), last unless $affil->findnodes("*[name()='ce:label']");
   push(@badnodes, $affil), last unless $affil->findnodes("*[name()='sa:affiliation']");
}   
print "Found ${\(~~@badnodes)} bad affiliation elements, with these IDs:\n";
print "\t", join("\n\t", map { $_->getAttribute('id') } @badnodes), "\n";

如果我在第一个示例周围包装文档元素,并在第二个affiliation元素上添加缺少的结束标记,我会得到此输出:

Found 1 bad affiliation elements, with these IDs:
    aff2