Question

我需要实现一个程序来计算perl中字符串中子字符串的出现次数。我已经实现了如下

sub countnmstr
{
  $count =0;
  $count++ while $_[0] =~ /$_[1]/g;
  return $count;
}

$count = countnmstr("aaa","aa");

print "$count\n";

现在这就是我通常会做的事情。但是，在上面的实现中，我想计算'aaa'中'aa'的出现次数。在这里，我得到的回答为1似乎是合理的，但我也需要考虑重叠的情况。因此，上述情况应该给出答案为2，因为如果我们考虑重叠，则有两个'aa'。

任何人都可以建议如何实现这样的功能??

Answer 1

每个人的答案都变得非常复杂（噢！daotoad应该把他的评论作为答案！），也许是因为他们害怕山羊运营商。我没有说出来，这正是人们所说的。它使用了一种技巧，即列表赋值的结果是右侧列表中的元素数。

用于计算匹配的Perl习语是：

 my $count = () = $_[0] =~ /($pattern)/g;

山羊部分是= () =，这是两个任务中间的空列表。山羊的左手部分从山羊的右侧获得计数。请注意，您需要在模式中捕获，因为这是匹配运算符将在列表上下文中返回的列表。

现在，你的下一个技巧是你真的想要一个积极的外观（或者可能是前瞻）。外观不消耗字符，因此您无需跟踪位置：

 my $count = () = 'aaa' =~ /((?<=a)a)/g;

您的aaa只是一个例子。如果您有可变宽度图案，则必须使用前瞻。 Perl中的Lookbehinds必须是固定的宽度。

Answer 2

参见ysth's answer ...我没有意识到模式可能只包含零宽度断言，并且仍然可以用于此目的。

您可以按照其他人的建议使用positive lookahead，并将函数编写为：

sub countnmstr {
    my ($haystack, $needle) = @_;
    my ($first, $rest) = $needle =~ /^(.)(.*)$/;
    return scalar (() = $haystack =~ /(\Q$first\E(?=\Q$rest\E))/g);
}

您还可以使用pos来调整下一次搜索的位置：

#!/usr/bin/perl

use strict; use warnings;

sub countnmstr {
    my ($haystack, $needle) = @_;
    my $adj = length($needle) - 1;
    die "Search string cannot be empty!" if $adj < 0;

    my $count = 0;
    while ( $haystack =~ /\Q$needle/g ) {
        pos $haystack -= $adj;
        $count += 1;
    }
    return $count;
}

print countnmstr("aaa","aa"), "\n";

输出：

C:\Temp> t
2

Answer 3

sub countnmstr
{
    my ($string, $substr) = @_;
    return scalar( () = $string =~ /(?=\Q$substr\E)/g );
}

$count = countnmstr("aaa","aa");

print "$count\n";

几点：

列表上下文中的

//g匹配尽可能多的次数。

\Q...\E用于自动转义任何元字符，因此您正在执行子字符串计数，而不是子模式计数。

使用前瞻(?= ... )会导致每个匹配不“消耗”任何字符串，允许在下一个字符处尝试以下匹配。

这使用相同的功能，其中标量上下文中的列表赋值（在本例中为空列表）返回列表赋值右侧的元素数，如goatse / flying-lentil / spread-eagle / whatever运算符，但使用标量（）而不是标量赋值来提供标量上下文。

$_[0]不是直接使用，而是复制到词汇中;如果传递的字符串存储了$_[0]，那么天真地使用$string代替//g会导致pos()在字符串的中途而不是在开头处开始。< / p>

更新：s /// g更快，但速度不如使用索引：

sub countnmstr
{
    my ($string, $substr) = @_;
    return scalar( $string =~ s/(?=\Q$substr\E)//g );
}

Answer 4

您可以在正则表达式中使用lookahead assertion：

sub countnmstr {
    my @matches = $_[0] =~ /(?=($_[1]))/g;

    return scalar @matches;
}

我怀疑思南的建议会更快。

Answer 5

你可以试试这个，不再需要正则表达式。

$haystack="aaaaabbbcc";
$needle = "aa";
while ( 1 ){
    $ind = index($haystack,$needle);
    if ( $ind == -1 ) {last};
    $haystack = substr($haystack,$ind+1);
    $count++;
}
print "Total count: $count\n";

输出

$ ./perl.pl
Total count: 4

Answer 6

如果速度是一个问题，ghostdog74建议的index方法（cjm的改进）可能比正则表达式解决方案快得多。

use strict;
use warnings;

sub countnmstr_regex {
    my ($haystack, $needle) = @_;
    return scalar( () = $haystack =~ /(?=\Q$needle\E)/g );
}

sub countnmstr_index {
    my ($haystack, $needle) = @_;
    my $i = 0;
    my $tally = 0;
    while (1){
        $i = index($haystack, $needle, $i);
        last if $i == -1;
        $tally ++;
        $i ++;
    }
    return $tally;
}

use Benchmark qw(cmpthese);

my $size = 1;
my $h = 'aaa aaaaaa' x $size;
my $n = 'aa';

cmpthese( -2, {
    countnmstr_regex => sub { countnmstr_regex($h, $n) },
    countnmstr_index => sub { countnmstr_index($h, $n) },
} );

__END__

# Benchmarks run on Windows.
# Result using a small haystack ($size = 1).
                     Rate countnmstr_regex countnmstr_index
countnmstr_regex  93701/s               --             -66%
countnmstr_index 271893/s             190%               --

# Result using a large haystack ($size = 100).
                   Rate countnmstr_regex countnmstr_index
countnmstr_regex  929/s               --             -81%
countnmstr_index 4960/s             434%               --

如何计算Perl中重叠的子串？

6 个答案: