如何使用perl比较xml标签之间的文本

时间:2013-12-04 05:13:09

标签: xml perl xml-parsing perl-module

我有像这样的xml数据

 <ce:affiliation id="aff1">
 <ce:label>a</ce:label>
 <ce:textfn>Department of Urology, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands</ce:textfn>
  <sa:affiliation>
 <sa:organization>Department of Urology</sa:organization>
 <sa:organization>Radboud University Nijmegen Medical Center</sa:organization>
 <sa:city>Nijmegen</sa:city>
 </sa:affiliation>

等等..

我想在阅读文本时阅读“sa:affiliation”内的文字,首先阅读sa中的标签内的文字:附件并制作文字,如“ Radboud University泌尿外科Nijmegen医疗中心,Nijmegen “在此”,“分离格式并将此文本与中的文本进行比较”ce:textn“....”/ ce:textn“

就像我需要将每个 ce:affillition 标记与多个文件的 sa:affilliation 进行比较,以及是否有任何不匹配需要告诉用户。

4 个答案:

答案 0 :(得分:2)

你的问题有点模糊。目前尚不清楚XML的每个片段的位置。一个文件?几个文件?每个文件一个片段?一些?如果数据位于多个文件中,您如何将ce:affilliation元素与相应的sa:affilliation相关联,尤其是在您检查的是2个文本是否匹配时?为什么sa:affilliation中没有国家/地区?声明名称空间在哪里?

假设2个数据位于2个文件中,并且名称空间前缀不会更改:

#!/usr/bin/perl

use strict;

use warnings;

use XML::Twig;
use Test::More;

my $DEFAULT_COUNTRY= "The Netherlands";

# usage is <tool> <ce file> <sa file>
my( $ce_file, $sa_file)= @ARGV;

my $ce= XML::Twig->new->parsefile( $ce_file)->root;
my $ce_text = $ce->field( 'ce:textfn');

my $sa= XML::Twig->new->parsefile( $sa_file)->root;

# add the country if not present
if( ! $sa->first_child( 'sa:country')) 
  { $sa->insert_new_elt( last_child => 'sa:country' => $DEFAULT_COUNTRY); }

my $sa_text= join( ', ', $sa->children_text);

is( $ce_text, $sa_text, "checking " . $ce->id);

done_testing();

答案 1 :(得分:1)

您可以使用XML::XPath查找所需的节点。然后只需检查两个节点是否正确。 string_valueneq

答案 2 :(得分:0)

最后我找到了这个代码,但是有没有任何方法来获取这个ce:affillition和sa:affillition text而不使用if else条件,因为它失败了一些条件。

#!/usr/bin/perl  
@files = <*.xml>;
open my $out, '>', 'output.xml' or die $!;
foreach $file (@files) {
open   (FILE, "$file");
$a =1;
while(my $line= <FILE> ){
do{
if ($line =~ /<ce:affiliation id=\"aff$a\">(.+?)<ce:textfn>(.+?)<\/ce:textfn><sa:affiliation>(.+?)<\/sa:affiliation><\/ce:affiliation>/){
$count = $3;
$textfn = $2;
print ("$count\n");
print ("$textfn\n");
if ($count =~ /<\/sa:(.+?)>/){
$count =~ s/<\/sa:organization>/, /g;
$count =~ s/<\/sa:city>/, /g;
$count =~ s/<\/sa:country>/, /g;
$count =~ s/<\/sa:state>/, /g;
$count =~ s/<sa:organization>//g;
$count =~ s/<sa:city>//g;
$count =~ s/<sa:country>//g;
$count =~ s/<sa:state>//g;
chop($count);
chop($count);
if($count ne $textfn){
print $out("$file affilliation $a is mismatch\n");}}}
else{
if($line =~ /<ce:affiliation id=\"aff$a\">(.+?)<ce:textfn>(.+?)<\/ce:textfn><\/ce:affiliation>/){
print $out("$file sa:affilliation missing for $a\n");}}
$a=$a+1;}
while($line =~ /aff$a/);}}

对于这种情况xml我得错了结果

 <ce:affiliation id="aff1"><ce:label>a</ce:label><ce:textfn>Department of Urology, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands</ce:textfn><sa:affiliation><sa:organization>Department of Urology</sa:organization><sa:organization>Radboud University Nijmegen Medical Center</sa:organization><sa:city>Nijmegen</sa:city><sa:country>The Netherlands</sa:country></sa:affiliation></ce:affiliation><ce:affiliation id="aff2"><ce:textfn>Norris Comprehensive Cancer Center, University of Southern California Institute of Urology, Los Angeles, California</ce:textfn></ce:affiliation><ce:affiliation id="aff3"><ce:label>c</ce:label><ce:textfn>Department of Urology, Stanford University, Stanford, California</ce:textfn><sa:affiliation><sa:organization>Department of Urology</sa:organization><sa:organization>Stanford University</sa:organization><sa:city>Stanford</sa:city><sa:state>California</sa:state></sa:affiliation></ce:affiliation><ce:correspondence id="cor1"></article>

答案 3 :(得分:0)

最后我得到了所需的输出。

#!/usr/bin/perl  
@files= <*.xml>;
open my $out, '>', 'output.xml' or die $!;
foreach $file (@files){
open   (FILE, "$file");
my $a =1;
while(my $line= <FILE> ){
do{
if($line =~ /<ce:affiliation id=\"aff$a\">(.+?)<\/ce:affiliation>/){
$count=$1;
if($count =~ /<ce:label>/){
$count=~ s/<ce:label>(.+?)<\/ce:label>//;}
if($count =~ /<sa:affiliation>/){
if($count =~ /<ce:textfn>(.+?)<\/ce:textfn><sa:affiliation>(.+?)<\/sa:affiliation>/){
$textfn=$1;
$sff=$2;
$sff =~ s/<\/sa:organization>/, /g;
$sff =~ s/<\/sa:city>/, /g;
$sff =~ s/<\/sa:country>/, /g;
$sff =~ s/<\/sa:state>/, /g;
$sff =~ s/<sa:organization>//g;
$sff =~ s/<sa:city>//g;
$sff =~ s/<sa:country>//g;
$sff =~ s/<sa:state>//g;
chop($sff);
chop($sff);}
if($textfn ne $sff){
print $out("$file ce:aff and sa:aff  mismatch in aff$a\n");}
if($textfn =~ /<ce:sup>/){
print $out("$file check label aff$a\n");}}
else{
if($line =~ /\"art520.dtd\"/){
print $out("$file strct affilition missing for aff$a\n");
}}}
$a=$a+1;
}while($line =~ /aff$a/);}}