为什么这个正则表达式返回的组比它应该多?

时间:2014-11-01 15:17:07

标签: regex perl

我正在浏览一本关于正则表达式的热门书籍并发现了这段正则表达式,它应该从包含逗号分隔值的行中选择值。

这应该处理双引号,""被视为转义双引号(在另一对双引号中允许序列""

这是我为此写的perl脚本:

$str = "Ten Thousand,10000, 2710 ,,\"10,000\",\"It's \"\"10 Grand\"\", baby\",10K";
#$regex = qr"(?:^|,)(?:\"((?:[^\"]|\"\")+)\"|([^\",]+))*";
$regex = qr!
        (?: ^|,)
        (?: 
            "
                ( (?: [^"] | "" )+ )
            "
            |
            ( [^",]+ )
        )
    !x;

@matches = ($str =~ m#$regex#g);
print "\nString : $str\n";
if (scalar(@matches) > 0 ) {
    print "\nMatches\n";
    print "\nNumber of groups: ", scalar(@matches), "\n";
    for ($i=0; $i < scalar(@matches); $i++) {
        print "\nGroup $i - |$matches[$i]|\n";
    }
}
else {
    print "\nDoesnt match\n";
}

这是我期待的输出(据我所知,这也是作者的期望):

String : Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K
   Matches
   Number of groups: 7
   Group 0 - |Ten Thousand|
   Group 1 - |10000|
   Group 2 - | 2710 |
   Group 3 - |10,000|
   Group 4 - ||
   Group 5 - |It's ""10 Grand"", baby|
   Group 6 - |10K|

这是我实际获得的输出:

String : Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K
   Matches
   Number of groups: 12
   Group 0 - ||
   Group 1 - |Ten Thousand|
   Group 2 - ||
   Group 3 - |10000|
   Group 4 - ||
   Group 5 - | 2710 |
   Group 6 - |10,000|
   Group 7 - ||
   Group 8 - |It's ""10 Grand"", baby|
   Group 9 - ||
   Group 10 - ||
   Group 11 - |10K|

有人可以解释为什么实际输出中有空组(除了10,000之前的那个,这是预期的)? 我直接从书中复制了正则表达式,那么还有其他我做错了吗?

TIA

3 个答案:

答案 0 :(得分:2)

该正则表达式有2个捕获组和几个非捕获组。当您将正则表达式应用于字符串时,您使用 g 修饰符告诉它继续匹配尽可能多的次数。在这种情况下,模式每次匹配6次,返回2个捕获的组,总共12个元素。

The regular expression:

(?-imsx:!
        (?: ^|,)

        (?:

            "

                ( (?: [^"] | "" )+ )

            "

            |

            ( [^",]+ )
        )
    !x)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  !                        '!\n        '
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
                             ' '
----------------------------------------------------------------------
    ^                        the beginning of the string
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ,                        ','
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
                           '\n\n        '
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
                  "          '\n\n            "\n\n                '
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
                               ' '
----------------------------------------------------------------------
      (?:                      group, but do not capture (1 or more
                               times (matching the most amount
                               possible)):
----------------------------------------------------------------------
                                 ' '
----------------------------------------------------------------------
        [^"]                     any character except: '"'
----------------------------------------------------------------------
                                 ' '
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
         ""                      ' "" '
----------------------------------------------------------------------
      )+                       end of grouping
----------------------------------------------------------------------
                               ' '
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
                  "          '\n\n            "\n\n            '
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
                             '\n\n            '
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
                               ' '
----------------------------------------------------------------------
      [^",]+                   any character except: '"', ',' (1 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
                               ' '
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
                             '\n        '
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
       !x                  '\n    !x'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

TLP已经提到你也可以使用Text :: CSV模块。这就是那个例子。

#!/usr/bin/perl

use strict;
use warnings;
use Text::CSV_XS;
use Data::Dumper;

my $csv = Text::CSV_XS->new({binary => 1, eol => $/, allow_whitespace => 1});

while (my $row = $csv->getline (*DATA)) {
    print Dumper $row;
}

__DATA__
Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K;

输出:

$VAR1 = [
          'Ten Thousand',
          '10000',
          '2710',
          '',
          '10,000',
          'It\'s "10 Grand", baby',
          '10K;'
        ];

答案 1 :(得分:1)

您可能会发现Perl 5核心模块Text::ParseWords很有用。只需几行代码即可完成所有操作。另请注意,您可以使用q()qq()来模拟单引号和双引号,这样您就不必转义引号。它们也可以与几乎任何标点字符一起使用,就像大多数perl类似引号的运算符一样。

use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;

my $str = q(Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K);
my @words = quotewords(',', 1, $str);
print Dumper \@words;

<强>输出:

$VAR1 = [
          'Ten Thousand',
          '10000',
          ' 2710 ',
          '',
          '"10,000"',
          '"It\'s ""10 Grand"", baby"',
          '10K'
        ];

(注意:It\'s中的转义单引号来自Data::Dumper

如果您的数据是正确的csv数据,则可以使用Text::CSV代替。

答案 2 :(得分:1)

我同意@RonBergin。捕获组始终保留。
因此,如果您有2个捕获组,则匹配6个匹配,这将产生一个
12个元素的数组。

看起来你想修剪并将捕捉组合成一个方法 是使用分支重置来制作并行管道。

我不想实际更改你的正则表达式,但是,下面的例子使用
分支重置一些强大的添加。

 # (?:^|,)(?|\s*"((?:[^"]|"")*)"\s*|\s*([^",]*?)\s*)(?=,|$)

 (?: ^ | , )                     # BOL or comma
 (?|                             # Start Branch Reset
      \s* 
      "
      (                               # (1 start), Quoted content
           (?: [^"] | "" )*
      )                               # (1 end)
      "
      \s* 
   |  
      \s*                             # Whitespace trim
      ( [^",]*? )                     # (1), Optional Non-quoted content
      \s*                             # Whitespace trim
 )                               # End Branch Reset
 (?= , | $ )                     # Lookahead for comma or EOL
                                 # (needed because content is optional)