使用Perl的split函数计算XML文件中的文本字符

时间:2014-07-21 12:15:53

标签: xml perl parsing split

我有一个XML文件,如下所示:

<?xml version="1.0" encoding="UTF-8" ?>
<ISOSpaceTaskv1.15>
<TEXT><![CDATA[
        WHERE TO GO
        In this sprawling city, the parts of Madrid of greatest
        interest to foreign visitors are remarkably compact. Viejo Madrid, the
        city of the Hapsburgs, covers a small area that extends east from the
        pitiful Río Manzanares and magnificent Palacio Real to Puerta del Sol.
        Almost all of it can be covered in a day or two, including a lengthy
        visit to the Royal Palace. The Madrid of the Bourbon dynasty, home to
        Spain’s great art museums, is the next area worthy of exploring (for
        art lovers, though, it may very well be the first). Spain’s Golden
        Triangle of Art is concentrated on the elegant but busy Paseo del
        Prado, between Puerta del Sol and Retiro Park. Those with more time in
        Madrid, either before or after side trips to the great towns of
        Castile, might explore the barrio of Salamanca, take in a bullfight, or
        visit one or more of the smaller, more personal museums, only a ride
        from the Puerta del Sol.
]]></TEXT>
<TAGS>
<PLACE id="pl0" start="66" end="72" text="Madrid" type="" dimensionality="AREA" form="NAM" domain="" continent="" state="" country="" ctv="" gazref="" latLong="" elevation="" mod="" dcl="FALSE" countable="TRUE" gquant="" scopes="" comment="" />
<PLACE id="pl1" start="47" end="51" text="city" type="" dimensionality="AREA" form="NOM" domain="" continent="" state="" country="" ctv="" gazref="" latLong="" elevation="" mod="" dcl="FALSE" countable="TRUE" gquant="" scopes="" comment="" />
</TAGS>
</ISOSpaceTaskv1.15>

start标记中的end<PLACE>属性表示text属性中的文字开头和结尾的字符{ {1}}标记。例如, Madrid 以第66个字符开头,并以<TEXT>中文本的第72个字符结尾。

我想知道<TEXT>中包含的文字中每个字的startend值。为此,我使用下一个Perl代码:

<TEXT>

问题是我得到的for my $tag ($doc->findnodes('ISOSpaceTaskv1.15/TEXT')){ my $text = $tag->textContent; my @Text_sp = split(undef, $text); my $count = 1; foreach my $character (@Text_sp){ print "$count\n"; $count = $count + 1; .... } } start值与xml文件中的值不同。例如,对于第一个PLACE标记,我得到值34和39.我怀疑分割不能按预期工作但我真的不知道确切的问题是什么。

2 个答案:

答案 0 :(得分:1)

您可以使用正则表达式搜索和pos功能:

while( $text=~/\b([^\s,.;:!?]+)/g ) {
   my $end = pos($text);
   my $start = $end-length($1);
   print "$start-$end $1\n";
}

第一个输出行如下所示:

9-14 WHERE
15-17 TO
18-20 GO
29-31 In
32-36 this
37-46 sprawling
47-51 city
53-56 the
57-62 parts
63-65 of
66-72 Madrid
73-75 of

似乎开始结束是正确的,可能您需要调整正则表达式以满足您想要的“单词边界”。

答案 1 :(得分:-1)

使用正则表达式捕获单词:

use strict;
use warnings;
use utf8;

use XML::LibXML;

binmode(STDOUT, ":unix:utf8");

my $xml = XML::LibXML->load_xml(string => do {
    binmode DATA, ':utf8';
    local $/;
    <DATA>
});

for my $tag ($xml->findnodes('ISOSpaceTaskv1.15/TEXT')){
    my $text = $tag->textContent;
    while ($text =~ m/([^\s,().]+)/g) {
        my $word = $1;
        my $start = $-[0];
        my $end = $+[0] - 1;
        print "$start-$end <$word>\n";
    }
}

__DATA__
<?xml version="1.0" encoding="UTF-8" ?>
<ISOSpaceTaskv1.15>
<TEXT><![CDATA[
        WHERE TO GO
        In this sprawling city, the parts of Madrid of greatest
        interest to foreign visitors are remarkably compact. Viejo Madrid, the
        city of the Hapsburgs, covers a small area that extends east from the
        pitiful Río Manzanares and magnificent Palacio Real to Puerta del Sol.
        Almost all of it can be covered in a day or two, including a lengthy
        visit to the Royal Palace. The Madrid of the Bourbon dynasty, home to
        Spain’s great art museums, is the next area worthy of exploring (for
        art lovers, though, it may very well be the first). Spain’s Golden
        Triangle of Art is concentrated on the elegant but busy Paseo del
        Prado, between Puerta del Sol and Retiro Park. Those with more time in
        Madrid, either before or after side trips to the great towns of
        Castile, might explore the barrio of Salamanca, take in a bullfight, or
        visit one or more of the smaller, more personal museums, only a ride
        from the Puerta del Sol.
]]></TEXT>
<TAGS>
<PLACE id="pl0" start="66" end="72" text="Madrid" type="" dimensionality="AREA" form="NAM" domain="" continent="" state="" country="" ctv="" gazref="" latLong="" elevation="" mod="" dcl="FALSE" countable="TRUE" gquant="" scopes="" comment="" />
<PLACE id="pl1" start="47" end="51" text="city" type="" dimensionality="AREA" form="NOM" domain="" continent="" state="" country="" ctv="" gazref="" latLong="" elevation="" mod="" dcl="FALSE" countable="TRUE" gquant="" scopes="" comment="" />
</TAGS>
</ISOSpaceTaskv1.15>

输出:

9-13 <WHERE>
15-16 <TO>
18-19 <GO>
29-30 <In>
32-35 <this>
37-45 <sprawling>
47-50 <city>
53-55 <the>
57-61 <parts>
63-64 <of>
66-71 <Madrid>
73-74 <of>
76-83 <greatest>
93-100 <interest>
102-103 <to>
105-111 <foreign>
113-120 <visitors>
122-124 <are>
126-135 <remarkably>
137-143 <compact>
146-150 <Viejo>
152-157 <Madrid>
160-162 <the>
172-175 <city>
177-178 <of>
180-182 <the>
184-192 <Hapsburgs>
195-200 <covers>
202-202 <a>
204-208 <small>
210-213 <area>
215-218 <that>
220-226 <extends>
228-231 <east>
233-236 <from>
238-240 <the>
250-256 <pitiful>
258-260 <Río>
262-271 <Manzanares>
273-275 <and>
277-287 <magnificent>
289-295 <Palacio>
297-300 <Real>
302-303 <to>
305-310 <Puerta>
312-314 <del>
316-318 <Sol>
329-334 <Almost>
336-338 <all>
340-341 <of>
343-344 <it>
346-348 <can>
350-351 <be>
353-359 <covered>
361-362 <in>
364-364 <a>
366-368 <day>
370-371 <or>
373-375 <two>
378-386 <including>
388-388 <a>
390-396 <lengthy>
406-410 <visit>
412-413 <to>
415-417 <the>
419-423 <Royal>
425-430 <Palace>
433-435 <The>
437-442 <Madrid>
444-445 <of>
447-449 <the>
451-457 <Bourbon>
459-465 <dynasty>
468-471 <home>
473-474 <to>
484-490 <Spain’s>
492-496 <great>
498-500 <art>
502-508 <museums>
511-512 <is>
514-516 <the>
518-521 <next>
523-526 <area>
528-533 <worthy>
535-536 <of>
538-546 <exploring>
549-551 <for>
561-563 <art>
565-570 <lovers>
573-578 <though>
581-582 <it>
584-586 <may>
588-591 <very>
593-596 <well>
598-599 <be>
601-603 <the>
605-609 <first>
613-619 <Spain’s>
621-626 <Golden>
636-643 <Triangle>
645-646 <of>
648-650 <Art>
652-653 <is>
655-666 <concentrated>
668-669 <on>
671-673 <the>
675-681 <elegant>
683-685 <but>
687-690 <busy>
692-696 <Paseo>
698-700 <del>
710-714 <Prado>
717-723 <between>
725-730 <Puerta>
732-734 <del>
736-738 <Sol>
740-742 <and>
744-749 <Retiro>
751-754 <Park>
757-761 <Those>
763-766 <with>
768-771 <more>
773-776 <time>
778-779 <in>
789-794 <Madrid>
797-802 <either>
804-809 <before>
811-812 <or>
814-818 <after>
820-823 <side>
825-829 <trips>
831-832 <to>
834-836 <the>
838-842 <great>
844-848 <towns>
850-851 <of>
861-867 <Castile>
870-874 <might>
876-882 <explore>
884-886 <the>
888-893 <barrio>
895-896 <of>
898-906 <Salamanca>
909-912 <take>
914-915 <in>
917-917 <a>
919-927 <bullfight>
930-931 <or>
941-945 <visit>
947-949 <one>
951-952 <or>
954-957 <more>
959-960 <of>
962-964 <the>
966-972 <smaller>
975-978 <more>
980-987 <personal>
989-995 <museums>
998-1001 <only>
1003-1003 <a>
1005-1008 <ride>
1018-1021 <from>
1023-1025 <the>
1027-1032 <Puerta>
1034-1036 <del>
1038-1040 <Sol>