Question

对于文件过滤器，我想使用一个单词数组，检查行是否匹配任何单词。

我已经有了一个相当简单的方法（只有必要的匹配部分）：

# check if any of the @words is found in $term

@words= qw/one
two
three/;
$term= "too for the show";

# the following looks very C like

$size= @words;
$found= 0;

for ($i= 0; $i<$size && !$found; $i++) {
   $found|= $term=~ /$words[$i]/;
}

printf "found= %d\n", $found;

在Perl中看到了很多神秘的语法和解决方案，我想知道是否（或更确切地说）是更紧凑的写法。

Answer 1

从所有单词创建正则表达式并只进行一次匹配：

#!/usr/bin/perl
use warnings;
use strict;

my @words = qw( one two three );

my $regex = join '|', map quotemeta, @words;

for my $term ('too for the show', 'five four three', 'bones') {
    my $found = $term =~ $regex;
    printf "found = %d\n", $found;
}

匹配/\b(?:$regex)\b/会阻止bones与one匹配。

Answer 2

使用Regexp::Assemble将搜索转换为一个正则表达式。这样，每个字符串只需扫描一次，使其对大量行更有效。

Regexp :: Assemble比手动执行更好。它有一个完整的API，您可能想要使用这样的正则表达式，它可以处理边缘情况，它可以智能地编译成更高效的正则表达式。

例如，此程序生成(?^:\b(?:t(?:hree|wo)|one)\b)，这将导致较少的回溯。随着您的单词列表大小的增加，这变得非常重要。 Perl的最新版本，大约5.14及更高版本，将为您完成此任务。

use strict;
use warnings;
use v5.10;

use Regexp::Assemble;

# Wrap each word in \b (word break) so only the full word is
# matched. 'one' will match 'money' but '\bone\b' won't.
my @words= qw(
    \bone\b
    \btwo\b
    \bthree\b
);

# These lines simulate reading from a file.
my @lines = (
    "won for the money\n",
    "two for the show\n",
    "three to get ready\n",
    "now go cat go!\n"
);

# Assemble all the words into one regex.
my $ra = Regexp::Assemble->new;
$ra->add(@words);

for my $line (@lines) {
    print $line if $line =~ $ra;
}

另请注意foreach style loop to iterate over an array，并使用statement modifier。

最后，我使用\b来确保只匹配实际的字词，而不是像money这样的子字符串。

Answer 3

这可能是将C代码转换为perl的过于简单的“翻译”。

Pro：这是紧凑的
Con：效率不高（其他答案在这里要好一些）。

@words= qw/one
two
three/;
$term= "too for the show";

my @found = grep { $term =~ /$_/; } @words;

printf "found= %d\n", scalar @found;

Perl，搜索字符串项的出现

3 个答案: