正则表达式以匹配XML样式文档中的任意数量的标记

时间:2016-02-29 16:02:46

标签: regex xml perl

我有一个XML样式文档,如下所示:

<sentence id="2339">
    <text>I charge it at night and skip taking the cord with me because of the good battery life.</text>
    <aspectTerms>
        <aspectTerm term="cord" polarity="neutral" from="41" to="45"/>
        <aspectTerm term="battery life" polarity="positive" from="74" to="86"/>
    </aspectTerms>
</sentence>
<sentence id="812">
    <text>I bought a HP Pavilion DV4-1222nr laptop and have had so many problems with the computer.</text>
</sentence>
<sentence id="1316">
    <text>The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.</text>
    <aspectTerms>
        <aspectTerm term="service center" polarity="negative" from="27" to="41"/>
        <aspectTerm term="&quot;sales&quot; team" polarity="negative" from="109" to="121"/>
        <aspectTerm term="tech guy" polarity="neutral" from="4" to="12"/>
    </aspectTerms>
</sentence>

我想要一个匹配1.句子的正则表达式和2.与句子对应的任何方面术语的极性。换句话说,像这样的列表:

[
    [
        "I charge it at night and skip taking the cord with me because of the good battery life.",
        "neutral",
        "positive"
    ],
    [
        "I bought a HP Pavilion DV4-1222nr laptop and have had so many problems with the computer."
    ], 
    [
        "The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.",
        "negative",
        "negative",
        "neutral"
    ]
]

我的问题是我只能匹配每个句子的方面术语的最后一个极性。我知道这与重复我的捕获组有关,但到目前为止,没有任何符号组合对我有效。

这是我现在的正则表达式:

/<sentence .*?>.*?<text>(.+?)<\/text>.*?(?:<aspectTerm.*?polarity="(.+?)".*?)*?<\/sentence>/gs

(我在perl中使用这个正则表达式。)

2 个答案:

答案 0 :(得分:4)

使用解析器。通过这样做,您可以访问xpath,这与regex非常相似,但“感知上下文” - 它理解XML的结构,这意味着正则表达式可能导致很多问题,再次离开。

像这样的东西(我会将格式化细节留给一方 - 但你的上面看起来好像你可以输出一个JSON arrray并获得所需的结果)

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig -> parse ( \*DATA );

foreach my $sentence ( $twig -> get_xpath('//sentence') ) {
    print "Text:", $sentence -> text,"\n";
    print "Polarities:", join( ",", map { $_ -> att('polarity')} $sentence -> get_xpath('.//aspectTerm/')),"\n";
}

__DATA__
<xml>
<sentence id="2339">
    <text>I charge it at night and skip taking the cord with me because of the good battery life.</text>
    <aspectTerms>
        <aspectTerm term="cord" polarity="neutral" from="41" to="45"/>
        <aspectTerm term="battery life" polarity="positive" from="74" to="86"/>
    </aspectTerms>
</sentence>
<sentence id="812">
    <text>I bought a HP Pavilion DV4-1222nr laptop and have had so many problems with the computer.</text>
</sentence>
<sentence id="1316">
    <text>The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.</text>
    <aspectTerms>
        <aspectTerm term="service center" polarity="negative" from="27" to="41"/>
        <aspectTerm term="&quot;sales&quot; team" polarity="negative" from="109" to="121"/>
        <aspectTerm term="tech guy" polarity="neutral" from="4" to="12"/>
    </aspectTerms>
</sentence>
</xml>

打印:

Text:I charge it at night and skip taking the cord with me because of the good battery life.
Polarities:neutral,positive
Text:I bought a HP Pavilion DV4-1222nr laptop and have had so many problems with the computer.
Polarities:
Text:The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the "sales" team, which is the retail shop which I bought my netbook from.
Polarities:negative,negative,neutral

答案 1 :(得分:0)

通常,使用正则表达式无法正确解析XML,除非数据表现良好,一致并使用XML规范的简单子集。使用专用的XML解析器模块(例如XML::TwigXML::LibXML)总是好得多。生成的程序通常更容易阅读,特别是一旦你习惯了XML DOM specification

LibXML具有许多语言的绑定库,包括Ruby,Python和PHP以及Perl,因此得到了广泛的支持

您没有说明XML数据的包装方式。 XML文档可能只有一个根节点,所以我想象它包含在XML::LibXML标签中

此程序使用@data处理您的数据并生成我认为您想要的结构。它期望输入XML文件的路径作为命令行上的参数

我已使用Data::Dump显示use strict; use warnings 'all'; use XML::LibXML; my $dom = XML::LibXML->load_xml(location => shift); my @data; for my $sentence ( $dom->findnodes('/root/sentence') ) { push @data, [ $sentence->findvalue('text'), map $_->getValue, $sentence->findnodes('aspectTerms/aspectTerm/@polarity') ]; } use Data::Dump; dd \@data; 的最终内容,该内容与您问题中的预期输出相对应

[
  [
    "I charge it at night and skip taking the cord with me because of the good battery life.",
    "neutral",
    "positive",
  ],
  [
    "I bought a HP Pavilion DV4-1222nr laptop and have had so many problems with the computer.",
  ],
  [
    "The tech guy then said the service center does not do 1-to-1 exchange and I have to direct my concern to the \"sales\" team, which is the retail shop which I bought my netbook from.",
    "negative",
    "negative",
    "neutral",
  ],
]

输出

REPORT ZZZ.

CLASS lcl_main DEFINITION FINAL CREATE PRIVATE.
  PUBLIC SECTION.
    CLASS-METHODS:
      main,
      reject.
  PRIVATE SECTION.
    TYPES:
      BEGIN OF t_num,
        num TYPE string,
      END OF t_num.
    CLASS-DATA:
      pa0013_01 TYPE t_num,
      pa0013_02 TYPE t_num,
      pa0013_03 TYPE t_num,
      pa0013_04 TYPE t_num,
      pa0013_05 TYPE t_num,
      pa0013_06 TYPE t_num,
      pa0000_01 TYPE t_num,
      pa0000_02 TYPE t_num,
      pa0000_03 TYPE t_num,
      pa0000_04 TYPE t_num,
      pa0000_05 TYPE t_num,
      pa0000_06 TYPE t_num,
      pa0005 TYPE t_num.
ENDCLASS.

CLASS lcl_main IMPLEMENTATION.
  METHOD main.
    DATA(lt_pa0013) = VALUE string_table(
      ( pa0013_01-num ) ( pa0013_02-num ) ( pa0013_03-num )
      ( pa0013_04-num ) ( pa0013_05-num ) ( pa0013_06-num )
    ).
    DATA(lt_pa0000) = VALUE string_table(
      ( pa0000_01-num ) ( pa0000_02-num ) ( pa0000_03-num )
      ( pa0000_04-num ) ( pa0000_05-num ) ( pa0000_06-num )
    ).
    DATA: lt_pa0000_hash TYPE SORTED TABLE OF string WITH NON-UNIQUE KEY TABLE_LINE.
    DATA(l_flg_empty_rest) = COND #( WHEN pa0005-num <> 0 THEN abap_false ELSE abap_true ).

    LOOP AT lt_pa0013 ASSIGNING FIELD-SYMBOL(<fs_pa0013>).
      IF <fs_pa0013> IS INITIAL.
        l_flg_empty_rest = abap_true.
      ENDIF.
      IF l_flg_empty_rest = abap_true.
        CLEAR <fs_pa0013>.
        lt_pa0000[ sy-tabix ] = space.
      ENDIF.
    ENDLOOP.

    lt_pa0000_hash = lt_pa0000.

    IF lt_pa0000_hash[ `3` ] IS INITIAL.
      reject( ).
    ENDIF.
  ENDMETHOD.

  METHOD reject.
    ASSERT 0 = 0.
  ENDMETHOD.
ENDCLASS.