我有一个XML样式文档,如下所示:
<sentence id="2339">
<text>I charge it at night and skip taking the cord with me because of the good battery life.</text>
<aspectTerms>
<aspectTerm term="cord" polarity="neutral" from="41" to="45"/>
<aspectTerm term="battery life" polarity="positive" from="74" to="86"/>
</aspectTerms>
</sentence>
<sentence id="812">
<text>I bought a HP Pavilion DV4-1222nr laptop and have had so many problems with the computer.</text>
</sentence>
<sentence id="1316">
<text>The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.</text>
<aspectTerms>
<aspectTerm term="service center" polarity="negative" from="27" to="41"/>
<aspectTerm term=""sales" team" polarity="negative" from="109" to="121"/>
<aspectTerm term="tech guy" polarity="neutral" from="4" to="12"/>
</aspectTerms>
</sentence>
我想要一个匹配1.句子的正则表达式和2.与句子对应的任何方面术语的极性。换句话说,像这样的列表:
[
[
"I charge it at night and skip taking the cord with me because of the good battery life.",
"neutral",
"positive"
],
[
"I bought a HP Pavilion DV4-1222nr laptop and have had so many problems with the computer."
],
[
"The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.",
"negative",
"negative",
"neutral"
]
]
我的问题是我只能匹配每个句子的方面术语的最后一个极性。我知道这与重复我的捕获组有关,但到目前为止,没有任何符号组合对我有效。
这是我现在的正则表达式:
/<sentence .*?>.*?<text>(.+?)<\/text>.*?(?:<aspectTerm.*?polarity="(.+?)".*?)*?<\/sentence>/gs
(我在perl中使用这个正则表达式。)
答案 0 :(得分:4)
使用解析器。通过这样做,您可以访问xpath
,这与regex
非常相似,但“感知上下文” - 它理解XML
的结构,这意味着正则表达式可能导致很多问题,再次离开。
像这样的东西(我会将格式化细节留给一方 - 但你的上面看起来好像你可以输出一个JSON
arrray并获得所需的结果)
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig -> parse ( \*DATA );
foreach my $sentence ( $twig -> get_xpath('//sentence') ) {
print "Text:", $sentence -> text,"\n";
print "Polarities:", join( ",", map { $_ -> att('polarity')} $sentence -> get_xpath('.//aspectTerm/')),"\n";
}
__DATA__
<xml>
<sentence id="2339">
<text>I charge it at night and skip taking the cord with me because of the good battery life.</text>
<aspectTerms>
<aspectTerm term="cord" polarity="neutral" from="41" to="45"/>
<aspectTerm term="battery life" polarity="positive" from="74" to="86"/>
</aspectTerms>
</sentence>
<sentence id="812">
<text>I bought a HP Pavilion DV4-1222nr laptop and have had so many problems with the computer.</text>
</sentence>
<sentence id="1316">
<text>The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.</text>
<aspectTerms>
<aspectTerm term="service center" polarity="negative" from="27" to="41"/>
<aspectTerm term=""sales" team" polarity="negative" from="109" to="121"/>
<aspectTerm term="tech guy" polarity="neutral" from="4" to="12"/>
</aspectTerms>
</sentence>
</xml>
打印:
Text:I charge it at night and skip taking the cord with me because of the good battery life.
Polarities:neutral,positive
Text:I bought a HP Pavilion DV4-1222nr laptop and have had so many problems with the computer.
Polarities:
Text:The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.
Polarities:negative,negative,neutral
答案 1 :(得分:0)
通常,使用正则表达式无法正确解析XML,除非数据表现良好,一致并使用XML规范的简单子集。使用专用的XML解析器模块(例如XML::Twig
或XML::LibXML
)总是好得多。生成的程序通常更容易阅读,特别是一旦你习惯了XML DOM specification
LibXML具有许多语言的绑定库,包括Ruby,Python和PHP以及Perl,因此得到了广泛的支持
您没有说明XML数据的包装方式。 XML文档可能只有一个根节点,所以我想象它包含在XML::LibXML
标签中
此程序使用@data
处理您的数据并生成我认为您想要的结构。它期望输入XML文件的路径作为命令行上的参数
我已使用Data::Dump
显示use strict;
use warnings 'all';
use XML::LibXML;
my $dom = XML::LibXML->load_xml(location => shift);
my @data;
for my $sentence ( $dom->findnodes('/root/sentence') ) {
push @data, [
$sentence->findvalue('text'),
map $_->getValue, $sentence->findnodes('aspectTerms/aspectTerm/@polarity')
];
}
use Data::Dump;
dd \@data;
的最终内容,该内容与您问题中的预期输出相对应
[
[
"I charge it at night and skip taking the cord with me because of the good battery life.",
"neutral",
"positive",
],
[
"I bought a HP Pavilion DV4-1222nr laptop and have had so many problems with the computer.",
],
[
"The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the \"sales\" team, which is the retail shop which I bought my netbook from.",
"negative",
"negative",
"neutral",
],
]
REPORT ZZZ.
CLASS lcl_main DEFINITION FINAL CREATE PRIVATE.
PUBLIC SECTION.
CLASS-METHODS:
main,
reject.
PRIVATE SECTION.
TYPES:
BEGIN OF t_num,
num TYPE string,
END OF t_num.
CLASS-DATA:
pa0013_01 TYPE t_num,
pa0013_02 TYPE t_num,
pa0013_03 TYPE t_num,
pa0013_04 TYPE t_num,
pa0013_05 TYPE t_num,
pa0013_06 TYPE t_num,
pa0000_01 TYPE t_num,
pa0000_02 TYPE t_num,
pa0000_03 TYPE t_num,
pa0000_04 TYPE t_num,
pa0000_05 TYPE t_num,
pa0000_06 TYPE t_num,
pa0005 TYPE t_num.
ENDCLASS.
CLASS lcl_main IMPLEMENTATION.
METHOD main.
DATA(lt_pa0013) = VALUE string_table(
( pa0013_01-num ) ( pa0013_02-num ) ( pa0013_03-num )
( pa0013_04-num ) ( pa0013_05-num ) ( pa0013_06-num )
).
DATA(lt_pa0000) = VALUE string_table(
( pa0000_01-num ) ( pa0000_02-num ) ( pa0000_03-num )
( pa0000_04-num ) ( pa0000_05-num ) ( pa0000_06-num )
).
DATA: lt_pa0000_hash TYPE SORTED TABLE OF string WITH NON-UNIQUE KEY TABLE_LINE.
DATA(l_flg_empty_rest) = COND #( WHEN pa0005-num <> 0 THEN abap_false ELSE abap_true ).
LOOP AT lt_pa0013 ASSIGNING FIELD-SYMBOL(<fs_pa0013>).
IF <fs_pa0013> IS INITIAL.
l_flg_empty_rest = abap_true.
ENDIF.
IF l_flg_empty_rest = abap_true.
CLEAR <fs_pa0013>.
lt_pa0000[ sy-tabix ] = space.
ENDIF.
ENDLOOP.
lt_pa0000_hash = lt_pa0000.
IF lt_pa0000_hash[ `3` ] IS INITIAL.
reject( ).
ENDIF.
ENDMETHOD.
METHOD reject.
ASSERT 0 = 0.
ENDMETHOD.
ENDCLASS.