Question

我正在尝试在perl中进行模式匹配，我在文件中读取的行的开头检查“非空格字符”，并返回第一个匹配的单词。

问题是，有时我会以“：”结尾的单词，有时我不会。

例如：

假设我有一个包含以下内容的文件。有时与替代内容。该文件将自动填充。

some0 Loren Posem:is some color::and some foo bar with 1023:4632
      some more content added to the file
some3 Loren Posem:is some color::and some foo bar with 1023:4632
      some more content added to the file

替代内容：

some1: Loren Posem:is some will be different with some number 5423:32
      some more content added to the file
some3: Loren Posem:is some will be different with some number 5423:32
      some more content added to the file

现在我只想从这个文件中提取第一个单词。但是如果文件具有备用内容，我仍然只想要忽略尾随'：'的第一个单词。

我这里只需要模式匹配部分。这就是我到目前为止所做的。

foreach ... 
    if  (/^(\S+):/) { 
        print $1;
    }

/ *如果我使用上面的模式匹配我从备用内容中获取第一个单词，即some1和some3忽略尾随“：”但是当我有原始内容时$ 1不匹配。 * /

但如果我使用

foreach ... 
    if  (/^(\S+)/) { 
        print $1;
    }

/ *现在替代内容将不匹配。 * /

这里有任何提示吗？

Answer 1

不包括空格和冒号的贪婪匹配：

while (<DATA>) {
    if  (/^([^:\s]+)/) { 
        print "$1\n";
    }
}

__DATA__
some0 Loren Posem:is some color::and some foo bar with 1023:4632
      some more content added to the file
some3 Loren Posem:is some color::and some foo bar with 1023:4632
      some more content added to the file
Alternate content:

some1: Loren Posem:is some will be different with some number 5423:32
      some more content added to the file
some3: Loren Posem:is some will be different with some number 5423:32
      some more content added to the file

Answer 2

如果要处理大量数据，split ting（并设置split的LIMIT）来获取第一个单词可以在捕获正则表达式方面提供显着的性能优势，在这种情况下：

foreach ... 
    if (  my $firstWord = ( split /[:\s]/, $_, 2 )[0] ) {
    print $firstWord, "\n";
}

Benchmark：

use strict;
use warnings;
use Benchmark qw/cmpthese/;

my @data = <DATA>;

sub _split {
    for (@data) {
        if ( my $firstWord = ( split /[:\s]/, $_, 2 )[0] ) {
            #print $firstWord, "\n";
        }
    }
}

sub _regex {
    for (@data) {
        if ( my ($firstWord) = /^([^:\s]+)/ ) {
            #print $firstWord, "\n";
        }
    }
}

cmpthese(
    -5,
    {
        _split => sub { _split() },
        _regex => sub { _regex() }
    }
);

__DATA__
some0 Loren Posem:is some color::and some foo bar with 1023:4632
some3 Loren Posem:is some color::and some foo bar with 1023:4632
some1: Loren Posem:is some will be different with some number 5423:3
some3: Loren Posem:is some will be different with some number 5423:32

输出（表中较快的时间较短）：

           Rate _regex _split
_regex 396843/s     --   -12%
_split 450546/s    14%     --

但是，您可能会发现正则表达式更具可读性。

希望这有帮助！

Perl返回匹配字符串，忽略结束分隔符（如果存在）

2 个答案: