Question

我有一个Ispell英文单词列表（近5万个单词），我在Perl中的作业是快速（比如在一分钟之内）所有字符串的列表，这是其他单词的子串。我已经尝试过比较所有单词的两个foreach循环的解决方案，但即使进行了一些优化，它仍然太慢。我认为，正确的解决方案可能是在单词数组上巧妙地使用正则表达式。你知道如何快速解决这个问题（在Perl中）吗？

Answer 1

我找到了快速解决方案，它可以在我的计算机上使用一个线程在大约15秒内找到所有这些子串。基本上，对于每个单词，我创建了每个可能的子串的数组（消除了仅在“s”或“s”结尾不同的子串）：

#take word and return list of all valid substrings
sub split_to_all_valid_subwords {
    my $word = $_[0];
    my @split_list;
    my ($i, $j);
    for ($i = 0; $i < length($word); ++$i){
        for ($j = 1; $j <= length($word) - $i; ++$j){
            unless
                (
                    ($j == length($word)) or
                    ($word =~ m/s$/ and $i == 0 and $j == length($word) - 1) or
                    ($word =~ m/\'s$/ and $i == 0 and $j == length($word) - 2)
                )
                {
                push(@split_list, substr($word, $i, $j));
            }
        }
    }
    return @split_list;
}

然后我只创建子串的所有候选列表并与单词交叉：

my @substring_candidates;

foreach my $word (@words) {
    push( @substring_candidates, split_to_all_valid_subwords($word));
}

#make intersection between substring candidates and words
my %substring_candidates=map{$_ =>1} @substring_candidates;
my %words=map{$_=>1} @words;
my @substrings = grep( $substring_candidates{$_}, @words );

现在在子串中，我有所有单词的数组，这是其他一些单词的子串。

Answer 2

Perl正则表达式会将像foo|bar|baz这样的模式优化为Aho-Corasick匹配 - 达到总编译正则表达式长度的某个限制。你的50000个单词可能会超过这个长度，但可以分成更小的组。（实际上，你可能想要按长度分解它们，只检查长度为N的单词，以包含长度为1到N-1的单词。）

或者，您可以在Perl代码中实现Aho-Corasick - 这样做很有趣。

Answer 3

更新

Ondra在他的回答中提供了一个美丽的解决方案;我将此帖留在这里作为过度思考问题和失败的优化技术的一个例子。

我最糟糕的情况是踢出一个与输入中任何其他单词都不匹配的单词。在这种情况下，它是二次的。 OPT_PRESORT尝试用大多数单词来表达最坏的情况。 OPT_CONSECUTIVE是线性复杂度过滤器，它减少了算法主要部分中的项目总数，但在考虑复杂性时它只是一个常数因素。但是，它仍然适用于Ondras算法并节省几秒钟，因为构建他的拆分列表比比较两个连续的单词更昂贵。

我更新了以下代码，选择ondras算法作为可能的优化。配对零线程和预分配优化，可以产生最大的性能。

我想分享我编码的解决方案。给定一个输入文件，它输出所有那些在同一输入文件中是任何其他单词的子串的单词。因此，它计算了与ysth的想法相反的观点，但我从他的答案中得出了优化＃2的想法。如果需要，可以停用以下三个主要优化。

多线程
问题＆＃34;列表L中的单词A？ L？＆＃34; 中的单词B是否可以轻松并行化。
预先排序所有单词的长度
我创建了一个数组，指向每个可能长度的长度超过一定长度的所有单词的列表。对于长话来说，这可以显着减少可能的单词数量，但它会交换相当多的空间，因为长度 n 中的一个单词出现在从长度1到长度 n的所有列表中
测试连续的单词
在我的/usr/share/dict/words中，大多数连续的行看起来非常相似：
```
Abby
Abby's
```
例如，
。由于与第一个单词匹配的每个单词也与第二个单词匹配，我立即将第一个单词添加到匹配单词列表中，并且仅保留第二个单词以进行进一步测试。这节省了我测试案例中大约30％的单词。因为我在优化No 2之前就这样做了，这也节省了很多空间。另一个权衡是输出不会被排序。

脚本本身长约120行;我在展示它之前解释每个子。

头

这只是多线程的标准脚本头。哦，你需要perl 5.10或更好来运行它。配置常量定义优化行为。在该字段中添加机器的处理器数量。 OPT_MAX变量可以获取您要处理的单词数，但是在优化发生后进行评估，因此{{1}已经捕获了简单的单词优化。添加任何东西会使脚本看起来更慢。 OPT_CONSECUTIVE确保立即显示状态更新。执行$|++后我exit。

main

#!/usr/bin/perl use strict; use warnings; use feature qw(say); use threads; $|=1; use constant PROCESSORS => 0; # (false, n) number of threads use constant OPT_MAX => 0; # (false, n) number of words to check use constant OPT_PRESORT => 0; # (true / false) sorts words by length use constant OPT_CONSECUTIVE => 1; # (true / false) prefilter data while loading use constant OPT_ONDRA => 1; # select the awesome Ondra algorithm use constant BLABBER_AT => 10; # (false, n) print progress at n percent die q(The optimisations Ondra and Presort are mutually exclusive.) if OPT_PRESORT and OPT_ONDRA; exit main();

封装主逻辑，并进行多线程处理。如果输入已排序，main的输出将远小于输入字的数量。在我选择了所有匹配的单词后，我将它们打印到STDOUT。所有状态更新等都会打印到STDERR，以便它们不会干扰输出。

n words will be matched

sub main { my @matching; # the matching words. my @words = load_words(\@matching); # the words to be searched say STDERR 0+@words . " words to be matched"; my $prepared_words = prepare_words(@words); # do the matching, possibly multithreading if (PROCESSORS) { my @threads = map {threads->new( \&test_range, $prepared_words, @words[$$_[0] .. $$_[1]] ) } divide(PROCESSORS, OPT_MAX || 0+@words); push @matching, $_->join for @threads; } else { push @matching, test_range( $prepared_words, @words[0 .. (OPT_MAX || 0+@words)-1]); } say STDERR 0+@matching . " words matched"; say for @matching; # print out the matching words. 0; }

这将读取输入文件中作为命令行参数提供的所有单词。这里进行了load_words优化。 OPT_CONSECUTIVE单词要么放入匹配单词列表中，要么放入稍后要匹配的单词列表中。 $last决定单词-1 != index($a, $b)是否为单词$b的子字符串。

$a

sub load_words { my $matching = shift; my @words; if (OPT_CONSECUTIVE) { my $last; while (<>) { chomp; if (defined $last) { push @{-1 != index($_, $last) ? $matching : \@words}, $last; } $last = $_; } push @words, $last // (); } else { @words = map {chomp; $_} <>; } @words; }

这＆＃34;爆炸＆＃34;输入的单词，在它们的长度之后将它们排序到每个槽中，其具有更大或相等长度的单词。因此，插槽1将包含所有单词。如果取消选择此优化，则它是无操作并直接传递输入列表。

prepare_words

sub prepare_words { if (OPT_ONDRA) { my $ondra_split = sub { # evil: using $_ as implicit argument my @split_list; for my $i (0 .. length $_) { for my $j (1 .. length($_) - ($i || 1)) { push @split_list, substr $_, $i, $j; } } @split_list; }; return +{map {$_ => 1} map &$ondra_split(), @_}; } elsif (OPT_PRESORT) { my @prepared = ([]); for my $w (@_) { push @{$prepared[$_]}, $w for 1 .. length $w; } return \@prepared; } else { return [@_]; } }

这测试单词test是否是任何其他单词中的子字符串。 $w指向由前一个子创建的数据结构：单词的平面列表或按长度排序的单词。然后选择适当的算法。几乎所有的运行时间都花在这个循环中。使用$wbl比使用正则表达式更快。

index

sub test { my ($w, $wbl) = @_; my $l = length $w; if (OPT_PRESORT) { for my $try (@{$$wbl[$l + 1]}) { return 1 if -1 != index $try, $w; } } else { for my $try (@$wbl) { return 1 if $w ne $try and -1 != index $try, $w; } } return 0; }

这只是封装了一种算法，可以保证divide项公平地分发到$items桶中。它输出一系列项目的界限。

$parcels

sub divide { my ($parcels, $items) = @_; say STDERR "dividing $items items into $parcels parcels."; my ($min_size, $rest) = (int($items / $parcels), $items % $parcels); my @distributions = map [ $_ * $min_size + ($_ < $rest ? $_ : $rest), ($_ + 1) * $min_size + ($_ < $rest ? $_ : $rest - 1) ], 0 .. $parcels - 1; say STDERR "range division: @$_" for @distributions; return @distributions; }

这为输入列表中的每个单词调用test_range，并且是多线程的子。 test选择输入列表中的所有元素，其中代码（作为第一个参数给出）返回true。它还定期输出状态消息，如grep，这使得等待completition更容易。这是心理上的优化; - ）。

thread 2 at 10%

调用

使用bash，我调用了像
这样的脚本
sub test_range { my $wbl = shift; if (BLABBER_AT) { my $range = @_; my $step = int($range / 100 * BLABBER_AT) || 1; my $i = 0; return grep { if (0 == ++$i % $step) { printf STDERR "... thread %d at %2d%%\n", threads->tid, $i / $step * BLABBER_AT; } OPT_ONDRA ? $wbl->{$_} : test($_, $wbl) } @_; } else { return grep {OPT_ONDRA ? $wbl->{$_} : test($_, $wbl)} @_; } }

其中$ time (head -n 1000 /usr/share/dict/words | perl script.pl >/dev/null)是我想输入的行数，1000是我使用的单词列表，而dict/words是我想要存储输出列表的地方，在此案例，抛弃输出。如果应该读取整个文件，它可以作为参数传递，如

/dev/null

$ perl script.pl input-file >output-file告诉我们脚本运行了多长时间。使用2个慢处理器和50000个单词，在我的情况下，它在两分钟内执行，实际上相当不错。

更新：现在更像是6-7秒，采用Ondra + Presort优化，没有线程。

进一步优化

更新：通过更好的算法克服。此部分不再完全有效。

多线程很糟糕。它分配了相当多的内存并且速度不是很快。考虑到数据量，这并不令人惊讶。我考虑使用time，但那个东西像$ @ *一样慢！因此是完全禁止的。如果Thread::Queue中的内部循环使用较低级别的语言进行编码，则可能会获得一些性能，因为不需要调用test内置函数。如果您可以编写C代码，请查看index模块。如果整个脚本都使用较低的语言编写，那么阵列访问也会更快。像Java这样的语言也会使多线程减少痛苦（而且成本更低）。

找到单词，这是有效的其他单词的子串

3 个答案:

更新

头

`sub load_words { my $matching = shift; my @words; if (OPT_CONSECUTIVE) { my $last; while (<>) { chomp; if (defined $last) { push @{-1 != index($_, $last) ? $matching : \@words}, $last; } $last = $_; } push @words, $last // (); } else { @words = map {chomp; $_} <>; } @words; }`

`sub test { my ($w, $wbl) = @_; my $l = length $w; if (OPT_PRESORT) { for my $try (@{$$wbl[$l + 1]}) { return 1 if -1 != index $try, $w; } } else { for my $try (@$wbl) { return 1 if $w ne $try and -1 != index $try, $w; } } return 0; }`

调用

进一步优化