Perl返回匹配字符串,忽略结束分隔符(如果存在)

时间:2014-03-09 06:38:23

标签: regex perl pattern-matching

我正在尝试在perl中进行模式匹配,我在文件中读取的行的开头检查“非空格字符”,并返回第一个匹配的单词。

问题是,有时我会以“:”结尾的单词,有时我不会。

例如:

假设我有一个包含以下内容的文件。有时与替代内容。该文件将自动填充。

some0 Loren Posem:is some color::and some foo bar with 1023:4632
      some more content added to the file
some3 Loren Posem:is some color::and some foo bar with 1023:4632
      some more content added to the file

替代内容:

some1: Loren Posem:is some will be different with some number 5423:32
      some more content added to the file
some3: Loren Posem:is some will be different with some number 5423:32
      some more content added to the file

现在我只想从这个文件中提取第一个单词。但是如果文件具有备用内容,我仍然只想要忽略尾随':'的第一个单词。

我这里只需要模式匹配部分。 这就是我到目前为止所做的。

foreach ... 
    if  (/^(\S+):/) { 
        print $1;
    }

/ *如果我使用上面的模式匹配我从备用内容中获取第一个单词,即some1和some3忽略尾随“:”但是当我有原始内容时$ 1不匹配。 * /

但如果我使用

foreach ... 
    if  (/^(\S+)/) { 
        print $1;
    }

/ *现在替代内容将不匹配。 * /

这里有任何提示吗?

2 个答案:

答案 0 :(得分:2)

不包括空格和冒号的贪婪匹配:

while (<DATA>) {
    if  (/^([^:\s]+)/) { 
        print "$1\n";
    }
}

__DATA__
some0 Loren Posem:is some color::and some foo bar with 1023:4632
      some more content added to the file
some3 Loren Posem:is some color::and some foo bar with 1023:4632
      some more content added to the file
Alternate content:

some1: Loren Posem:is some will be different with some number 5423:32
      some more content added to the file
some3: Loren Posem:is some will be different with some number 5423:32
      some more content added to the file

答案 1 :(得分:1)

如果要处理大量数据,split ting(并设置split的LIMIT)来获取第一个单词可以在捕获正则表达式方面提供显着的性能优势,在这种情况下:

foreach ... 
    if (  my $firstWord = ( split /[:\s]/, $_, 2 )[0] ) {
    print $firstWord, "\n";
}

Benchmark

use strict;
use warnings;
use Benchmark qw/cmpthese/;

my @data = <DATA>;

sub _split {
    for (@data) {
        if ( my $firstWord = ( split /[:\s]/, $_, 2 )[0] ) {
            #print $firstWord, "\n";
        }
    }
}

sub _regex {
    for (@data) {
        if ( my ($firstWord) = /^([^:\s]+)/ ) {
            #print $firstWord, "\n";
        }
    }
}

cmpthese(
    -5,
    {
        _split => sub { _split() },
        _regex => sub { _regex() }
    }
);

__DATA__
some0 Loren Posem:is some color::and some foo bar with 1023:4632
some3 Loren Posem:is some color::and some foo bar with 1023:4632
some1: Loren Posem:is some will be different with some number 5423:3
some3: Loren Posem:is some will be different with some number 5423:32

输出(表中较快的时间较短):

           Rate _regex _split
_regex 396843/s     --   -12%
_split 450546/s    14%     --

但是,您可能会发现正则表达式更具可读性。

希望这有帮助!