Question

我正在寻找有关Perl grep函数如何工作的一些细节。我这样做：

if ( grep{ $foo == $_ } @bar ) {
  some code;
}

假设@bar很大（数十万个元素）。根据我的数据，如果我对@bar进行排序，则$foo的值更有可能出现在数组的开头附近而不是接近结尾。我想知道这是否有助于提升表现。

换句话说，使用上面的代码，grep按顺序移动@bar检查是否为$foo == $_，然后在找到任何值后立即退出？或者它会在返回值之前实际检查@bar的每个元素吗？

Answer 1

grep不会短路，因此元素的排序无关紧要。

当List :: MoreUtils的first发生短路时，整个列表必须在被调用之前放在堆栈上。

这将是最好的：

for (@bar) {
   if ($foo == $_) {
      some code;
      last;
   }
}

已更新：我最初迭代索引，因为它使用O（1）内存，但for (@bar)（与一般的for (LIST)相反）也是如此，因为ysth提醒我

Answer 2

由于grep的使用是在标量上下文中，因此返回匹配元素的数量。为了计算这一点，Perl无论如何都必须访问每个元素，因此从这个角度来看，排序不太可能有助于提高性能。

Answer 3

在您的示例中， grep 将迭代整个数组，无论有多少元素匹配。

如果您能够对此数组进行排序 - 使用二进制搜索搜索您的值会更快。你也可以将你的数组转换为哈希（使用keys = array元素）并使用常量时间进行检查，但这会占用额外的内存。

Answer 4

关于你的问题

根据我的数据，如果我对@bar进行排序，$ foo的值更可能出现在数组的开头附近而不是接近结尾。我想知道这是否有助于提升表现。

如果列表按数字顺序排序，那么您可以编写

sub contains {
  my ($list, $item) = @_;
  for (@$item) {
    return $_ == $item if $_ >= $item;
  }
  return !1;
}

some_code() if contains(\@bar, $foo);

Answer 5

这取决于。 grep { $x == $_ } @a不会从分支预测中受益，但grep { $x < $_ } @a会受益！

#!/usr/bin/env perl

use strict;
use warnings;

use Time::HiRes qw( clock_gettime CLOCK_MONOTONIC );

use constant MIN => 0;
use constant MAX => 1000;
use constant AVG => int(MIN  + (MAX - MIN) / 2);
use constant N_LOOPS => 40000;
use constant ARRAY_LEN => 10000;

## is grep faster for sorted arrays?

##
## RANDOM ARRAY VALUES
##
my $n = 0;
my @a = map { int(rand() * (MAX - MIN) + MIN) } 1 .. ARRAY_LEN;
my $duration = -clock_gettime ( CLOCK_MONOTONIC );
for(my $i = 0; $i < N_LOOPS; $i++) {
    $n += grep { AVG < $_ } @a;
}
$duration += clock_gettime ( CLOCK_MONOTONIC );
print "duration: $duration secs, n = $n".$/;

##
## PREDICTABLE ARRAY VALUES
##
$n = 0;
@a = sort {$a <=> $b} @a;
$duration = -clock_gettime ( CLOCK_MONOTONIC );
for(my $i = 0; $i < N_LOOPS; $i++) {
    $n += grep { AVG < $_ } @a;
}
$duration += clock_gettime ( CLOCK_MONOTONIC );
print "duration: $duration secs, n = $n".$/;

## and now we try to eliminate side effects by repeating

##
## RANDOM ARRAY VALUES
##
$n = 0;
@a = map { int(rand() * (MAX - MIN) + MIN) } 1 .. ARRAY_LEN;
$duration = -clock_gettime ( CLOCK_MONOTONIC );
for(my $i = 0; $i < N_LOOPS; $i++) {
    $n += grep { AVG < $_ } @a;
}   
$duration += clock_gettime ( CLOCK_MONOTONIC );
print "duration: $duration secs, n = $n".$/; 

##
## PREDICTABLE ARRAY VALUES
##
$n = 0;
@a = sort {$a <=> $b} @a;
$duration = -clock_gettime ( CLOCK_MONOTONIC );
for(my $i = 0; $i < N_LOOPS; $i++) {
    $n += grep { AVG < $_ } @a;
}   
$duration += clock_gettime ( CLOCK_MONOTONIC );
print "duration: $duration secs, n = $n".$/;

结果：

duration: 27.7465513650095 secs, n = 199880000 <-- unsorted
duration: 26.129752348992 secs, n = 199880000  <-- sorted
duration: 28.3962040760089 secs, n = 202920000 <-- unsorted
duration: 26.082420132996 secs, n = 202920000  <-- sorted

另见Why is it faster to process a sorted array than an unsorted array?

排序是否有助于Perl中grep的效率

5 个答案: