我有一个XML文件,如下所示:
<?xml version="1.0" encoding="UTF-8" ?>
<ISOSpaceTaskv1.15>
<TEXT><![CDATA[
WHERE TO GO
In this sprawling city, the parts of Madrid of greatest
interest to foreign visitors are remarkably compact. Viejo Madrid, the
city of the Hapsburgs, covers a small area that extends east from the
pitiful Río Manzanares and magnificent Palacio Real to Puerta del Sol.
Almost all of it can be covered in a day or two, including a lengthy
visit to the Royal Palace. The Madrid of the Bourbon dynasty, home to
Spain’s great art museums, is the next area worthy of exploring (for
art lovers, though, it may very well be the first). Spain’s Golden
Triangle of Art is concentrated on the elegant but busy Paseo del
Prado, between Puerta del Sol and Retiro Park. Those with more time in
Madrid, either before or after side trips to the great towns of
Castile, might explore the barrio of Salamanca, take in a bullfight, or
visit one or more of the smaller, more personal museums, only a ride
from the Puerta del Sol.
]]></TEXT>
<TAGS>
<PLACE id="pl0" start="66" end="72" text="Madrid" type="" dimensionality="AREA" form="NAM" domain="" continent="" state="" country="" ctv="" gazref="" latLong="" elevation="" mod="" dcl="FALSE" countable="TRUE" gquant="" scopes="" comment="" />
<PLACE id="pl1" start="47" end="51" text="city" type="" dimensionality="AREA" form="NOM" domain="" continent="" state="" country="" ctv="" gazref="" latLong="" elevation="" mod="" dcl="FALSE" countable="TRUE" gquant="" scopes="" comment="" />
</TAGS>
</ISOSpaceTaskv1.15>
start
标记中的end
和<PLACE>
属性表示text
属性中的文字开头和结尾的字符{ {1}}标记。例如, Madrid 以第66个字符开头,并以<TEXT>
中文本的第72个字符结尾。
我想知道<TEXT>
中包含的文字中每个字的start
和end
值。为此,我使用下一个Perl代码:
<TEXT>
问题是我得到的for my $tag ($doc->findnodes('ISOSpaceTaskv1.15/TEXT')){
my $text = $tag->textContent;
my @Text_sp = split(undef, $text);
my $count = 1;
foreach my $character (@Text_sp){
print "$count\n";
$count = $count + 1;
....
}
}
和start
值与xml文件中的值不同。例如,对于第一个PLACE标记,我得到值34和39.我怀疑分割不能按预期工作但我真的不知道确切的问题是什么。
答案 0 :(得分:1)
您可以使用正则表达式搜索和pos
功能:
while( $text=~/\b([^\s,.;:!?]+)/g ) {
my $end = pos($text);
my $start = $end-length($1);
print "$start-$end $1\n";
}
第一个输出行如下所示:
9-14 WHERE
15-17 TO
18-20 GO
29-31 In
32-36 this
37-46 sprawling
47-51 city
53-56 the
57-62 parts
63-65 of
66-72 Madrid
73-75 of
似乎开始和结束是正确的,可能您需要调整正则表达式以满足您想要的“单词边界”。
答案 1 :(得分:-1)
使用正则表达式捕获单词:
use strict;
use warnings;
use utf8;
use XML::LibXML;
binmode(STDOUT, ":unix:utf8");
my $xml = XML::LibXML->load_xml(string => do {
binmode DATA, ':utf8';
local $/;
<DATA>
});
for my $tag ($xml->findnodes('ISOSpaceTaskv1.15/TEXT')){
my $text = $tag->textContent;
while ($text =~ m/([^\s,().]+)/g) {
my $word = $1;
my $start = $-[0];
my $end = $+[0] - 1;
print "$start-$end <$word>\n";
}
}
__DATA__
<?xml version="1.0" encoding="UTF-8" ?>
<ISOSpaceTaskv1.15>
<TEXT><![CDATA[
WHERE TO GO
In this sprawling city, the parts of Madrid of greatest
interest to foreign visitors are remarkably compact. Viejo Madrid, the
city of the Hapsburgs, covers a small area that extends east from the
pitiful Río Manzanares and magnificent Palacio Real to Puerta del Sol.
Almost all of it can be covered in a day or two, including a lengthy
visit to the Royal Palace. The Madrid of the Bourbon dynasty, home to
Spain’s great art museums, is the next area worthy of exploring (for
art lovers, though, it may very well be the first). Spain’s Golden
Triangle of Art is concentrated on the elegant but busy Paseo del
Prado, between Puerta del Sol and Retiro Park. Those with more time in
Madrid, either before or after side trips to the great towns of
Castile, might explore the barrio of Salamanca, take in a bullfight, or
visit one or more of the smaller, more personal museums, only a ride
from the Puerta del Sol.
]]></TEXT>
<TAGS>
<PLACE id="pl0" start="66" end="72" text="Madrid" type="" dimensionality="AREA" form="NAM" domain="" continent="" state="" country="" ctv="" gazref="" latLong="" elevation="" mod="" dcl="FALSE" countable="TRUE" gquant="" scopes="" comment="" />
<PLACE id="pl1" start="47" end="51" text="city" type="" dimensionality="AREA" form="NOM" domain="" continent="" state="" country="" ctv="" gazref="" latLong="" elevation="" mod="" dcl="FALSE" countable="TRUE" gquant="" scopes="" comment="" />
</TAGS>
</ISOSpaceTaskv1.15>
输出:
9-13 <WHERE>
15-16 <TO>
18-19 <GO>
29-30 <In>
32-35 <this>
37-45 <sprawling>
47-50 <city>
53-55 <the>
57-61 <parts>
63-64 <of>
66-71 <Madrid>
73-74 <of>
76-83 <greatest>
93-100 <interest>
102-103 <to>
105-111 <foreign>
113-120 <visitors>
122-124 <are>
126-135 <remarkably>
137-143 <compact>
146-150 <Viejo>
152-157 <Madrid>
160-162 <the>
172-175 <city>
177-178 <of>
180-182 <the>
184-192 <Hapsburgs>
195-200 <covers>
202-202 <a>
204-208 <small>
210-213 <area>
215-218 <that>
220-226 <extends>
228-231 <east>
233-236 <from>
238-240 <the>
250-256 <pitiful>
258-260 <Río>
262-271 <Manzanares>
273-275 <and>
277-287 <magnificent>
289-295 <Palacio>
297-300 <Real>
302-303 <to>
305-310 <Puerta>
312-314 <del>
316-318 <Sol>
329-334 <Almost>
336-338 <all>
340-341 <of>
343-344 <it>
346-348 <can>
350-351 <be>
353-359 <covered>
361-362 <in>
364-364 <a>
366-368 <day>
370-371 <or>
373-375 <two>
378-386 <including>
388-388 <a>
390-396 <lengthy>
406-410 <visit>
412-413 <to>
415-417 <the>
419-423 <Royal>
425-430 <Palace>
433-435 <The>
437-442 <Madrid>
444-445 <of>
447-449 <the>
451-457 <Bourbon>
459-465 <dynasty>
468-471 <home>
473-474 <to>
484-490 <Spain’s>
492-496 <great>
498-500 <art>
502-508 <museums>
511-512 <is>
514-516 <the>
518-521 <next>
523-526 <area>
528-533 <worthy>
535-536 <of>
538-546 <exploring>
549-551 <for>
561-563 <art>
565-570 <lovers>
573-578 <though>
581-582 <it>
584-586 <may>
588-591 <very>
593-596 <well>
598-599 <be>
601-603 <the>
605-609 <first>
613-619 <Spain’s>
621-626 <Golden>
636-643 <Triangle>
645-646 <of>
648-650 <Art>
652-653 <is>
655-666 <concentrated>
668-669 <on>
671-673 <the>
675-681 <elegant>
683-685 <but>
687-690 <busy>
692-696 <Paseo>
698-700 <del>
710-714 <Prado>
717-723 <between>
725-730 <Puerta>
732-734 <del>
736-738 <Sol>
740-742 <and>
744-749 <Retiro>
751-754 <Park>
757-761 <Those>
763-766 <with>
768-771 <more>
773-776 <time>
778-779 <in>
789-794 <Madrid>
797-802 <either>
804-809 <before>
811-812 <or>
814-818 <after>
820-823 <side>
825-829 <trips>
831-832 <to>
834-836 <the>
838-842 <great>
844-848 <towns>
850-851 <of>
861-867 <Castile>
870-874 <might>
876-882 <explore>
884-886 <the>
888-893 <barrio>
895-896 <of>
898-906 <Salamanca>
909-912 <take>
914-915 <in>
917-917 <a>
919-927 <bullfight>
930-931 <or>
941-945 <visit>
947-949 <one>
951-952 <or>
954-957 <more>
959-960 <of>
962-964 <the>
966-972 <smaller>
975-978 <more>
980-987 <personal>
989-995 <museums>
998-1001 <only>
1003-1003 <a>
1005-1008 <ride>
1018-1021 <from>
1023-1025 <the>
1027-1032 <Puerta>
1034-1036 <del>
1038-1040 <Sol>