Question

我有一些功能，我在各种文本上运行了一百多万次，这意味着这些功能的小改进转化为总体上的巨大收益。目前，我已经注意到，所有涉及字数统计的函数都比其他所有函数花费的时间要长得多，所以我想我想尝试以不同的方式进行字数统计。

基本上，我的函数所做的是获取许多与其关联的文本的对象，验证该文本与某些模式不匹配，然后计算该文本中的单词数。该功能的基本版本是：

my $num_words = 0;
for (my $i=$begin_pos; $i<=$end_pos; $i++) {
   my $text = $self->_getTextFromNode($i);
   #If it looks like a node full of bogus text, or just a number, remove it.
   if ($text =~ /^\s*\<.*\>\s*$/ && $begin_pos == $end_pos) { return 0; }
   if ($text =~ /^\s*(?:Page\s*\d+)|http/i && $begin_pos == $end_pos) { return 0; }
   if ($text =~ /^\s*\d+\s*$/ && $begin_pos == $end_pos) { return 0; }
   my @text_words = split(/\s+/, $text);
   $num_words += scalar(@text_words);
   if ($num_words > 30) { return 30; }
}
return $num_words;
}

我正在进行大量的文本比较，类似于我在其他地方在我的代码中所做的，所以我猜我的问题必须与我的单词计数。有没有比分割\s+更快的方法呢？如果是这样，它是什么，为什么它更快（所以我可以理解我做错了什么，并可以在以后将这些知识应用于类似的问题）。

Answer 1

使用带有正则表达式的while循环是我发现计算单词的最快方法：

my $text = 'asdf asdf asdf asdf asdf';

sub count_array {
   my @text_words = split(/\s+/, $text);
   scalar(@text_words);
}

sub count_list {
    my $x =()= $text =~ /\S+/g;       #/
}

sub count_while {
    my $num; 
    $num++ while $text =~ /\S+/g;     #/
    $num
}

say count_array; # 5
say count_list;  # 5
say count_while; # 5

use Benchmark 'cmpthese';

cmpthese -2 => {
    array => \&count_array,
    list  => \&count_list,
    while => \&count_while,
}

#          Rate  list array while
# list  303674/s    --  -22%  -55%
# array 389212/s   28%    --  -42%
# while 675295/s  122%   74%    --

while循环更快，因为不需要为每个找到的单词分配内存。正则表达式也在布尔上下文中，这意味着它不需要从字符串中提取实际匹配。

Answer 2

如果单词仅由单个空格分隔，则计算空格很快。

sub count1
{
    my $str = shift;
    return 1 + ($str =~ tr{ }{ });
}

更新基准：

my $text = 'asdf asdf asdf asdf asdf';

sub count_array {
   my @text_words = split(/\s+/, $text);
   scalar(@text_words);
}

sub count_list {
   my $x =()= $text =~ /\S+/g;       #/
}

sub count_while {
   my $num; 
   $num++ while $text =~ /\S+/g;     #/
   $num
}

sub count_tr {
    1 + ($text =~ tr{ }{ });
}

say count_array; # 5
say count_list;  # 5
say count_while; # 5
say count_tr; # 5

use Benchmark 'cmpthese';

cmpthese -2 => {
    array => \&count_array,
    list  => \&count_list,
    while => \&count_while,
    tr    => \&count_tr,
}

#            Rate  list while array    tr
# list   220911/s    --  -24%  -44%  -94%
# while  291225/s   32%    --  -26%  -92%
# array  391769/s   77%   35%    --  -89%
# tr    3720197/s 1584% 1177%  850%    --

Answer 3

由于您将单词数限制为30，您可以从之前的函数返回：

while ($text =~ /\S+/g) {
    ++$num_words == 30 && return $num_words;
}    
return $num_words;

或使用split：

$num_words = () = split /\s+/, $text, 30;

Answer 4

为了正确，从aleroot's answer开始，您可能需要split " "，而不是原始split /\s+/，以避免出现fencepost错误： “/ \ s + /”上的“split”就像是“split（''）”，除了任何前导空格都会产生一个空的第一个字段。 *这个区别会给你一个每行额外的单词（空的第一个字段，即）。

对于速度，由于您将单词数量限制为30，因此您可能希望使用LIMIT参数*：split " ", $str, 30。

另一方面，其他答案明智地指出你完全远离split，因为你不需要单词列表，只需要他们的计数。

Answer 5

由于您只需要单词数而不是单词数组，因此最好避免使用split。这可能有用：

$num_words += $text =~ s/((^|\s)\S)/$1/g;

它用自己替换每个单词的工作取代了构建单词数组的工作。你需要对它进行基准测试以确定它是否更快。

编辑：这可能会更快：

++$num_words while $text =~ /\S+/g;

在Perl中计算字符串中单词数的最快方法是什么？

5 个答案: