perl:字符串匹配以查找最长的子字符串

时间:2015-11-10 19:24:13

标签: regex perl

$string1 = "peachbananaapplepear";
$string2 = "juicenaapplewatermelonpear";

我想知道包含单词" apple"的最长公共子字符串是什么。

$string2 =~ m/.+apple.+/;
print $string2;

所以我使用匹配运算符和.+来匹配关键字" apple"之前和之后的任何字符。当我打印$string2时,它不会返回naapple,而是会返回原来的$string2

3 个答案:

答案 0 :(得分:1)

这是一种方法。首先得到苹果'的位置。出现在字符串中。对于string1中的每个位置,请查看string2中的所有位置。 向左和向右看,看看共性从初始位置延伸到多远。

$string1 = "peachbananaapplepear12345applegrapeapplebcdefghijk";
$string2 = "juicenaapplewatermelonpearkiwi12345applebcdefghijkberryapple";

my $SearchFor="apple";
my $SearchStrLen = length($SearchFor);

# Get locations in first string where the search term appears
my @FirstPositions =  getPostions($string1);
# Get locations in second string where the search term appears
my @SecondPositions =  getPostions($string2);

CheckForMaxMatch();

sub getPostions
{
  my $GivenString = shift;
  my @Positions;
  my $j=0;
  for (my $i=0; $i < length($GivenString); $i += ($SearchStrLen+1) )
  {
    $j = index($GivenString, $SearchFor, $i);
    if ($j == -1) {
     last;
    }
    push (@Positions, $j);
    $i = $j;
  }

  return @Positions;
}

sub CheckForMaxMatch
{
  my $MaxLeft=0;
  # From the location of 'apple', look to the left and right
  # to see how far the characters are same
  for my $i (@FirstPositions) {
    for my $j (@SecondPositions) {
      my $LeftMatchPos = getMaxMatch($i, $j, -1);
      my $RightMatchPos = getMaxMatch($i, $j, 1);

      if ( ($RightMatchPos - $LeftMatchPos) > ($MaxRight - $MaxLeft) ) {
        $MaxLeft = $LeftMatchPos;
        $MaxRight = $RightMatchPos;
      }
    }
  }

  my $LongestSubString = substr($string1, $MaxLeft, $MaxRight-$MaxLeft);
  print "Longest common substring is: $LongestSubString\n";
  print "It begins at $MaxLeft and ends at $MaxRight in string1\n";
}

sub getMaxMatch
{
  my $i= shift;
  my $j= shift;
  my $direction= shift;

  my $k = ($direction >= 1 ? $SearchStrLen : 0);

  my $FirstChar = substr($string1, $i+($k * $direction), 1);
  my $SecondChar = substr($string2, $j+($k * $direction), 1);

  for ( ; $FirstChar && $SecondChar; $k++ )
  {
    $FirstChar = substr($string1, $i+($k * $direction), 1);
    $SecondChar = substr($string2, $j+($k * $direction), 1);
    if ( $FirstChar ne $SecondChar ) {
      $direction < 1 ? $k-- : "";
      my $pos = ($k ? ($i + $k * $direction) : $i);
      return $pos;
    }
  }

  return $i;
}

答案 1 :(得分:0)

=〜运算符不会重新分配$ string2的值。试试这个:

$string2 =~ m/(.+apple.+)/;
my $match = $1;
print $match

答案 2 :(得分:0)

基于general algorithm,但不仅跟踪当前运行的长度(@l),还跟踪是否包含关键字(@k)。只有包含关键字的运行才会被视为运行时间最长。

use strict;
use warnings;
use feature qw( say );

sub find_substrs {
   our $s;   local *s   = \shift;
   our $key; local *key = \shift;

   my @positions;
   my $position = -1;
   while (1) {
      $position = index($s, $key, $position+1);
      last if $position < 0;

      push @positions, $position;
   }

   return @positions;
}

sub lcsubstr_which_include {
   our $s1;  local *s1  = \shift;
   our $s2;  local *s2  = \shift;
   our $key; local *key = \shift;

   my @key_starts1 = find_substrs($s1, $key)
      or return;

   my @key_starts2 = find_substrs($s2, $key)
      or return;

   my @is_key_start1;  $is_key_start1[$_] = 1 for @key_starts1;
   my @is_key_start2;  $is_key_start2[$_] = 1 for @key_starts2;

   my @s1 = split(//, $s1);
   my @s2 = split(//, $s2);

   my $length = 0;
   my @rv;
   my @l = ( 0 ) x ( @s1 + 1 );  # Last ele is read when $i1==0.
   my @k = ( 0 ) x ( @s1 + 1 );  # Same.
   for my $i2 (0..$#s2) {
      for my $i1 (reverse 0..$#s1) {
         if ($s1[$i1] eq $s2[$i2]) {
            $l[$i1] = $l[$i1-1] + 1;
            $k[$i1] = $k[$i1-1] || ( $is_key_start1[$i1] && $is_key_start2[$i2] );

            if ($k[$i1]) {
               if ($l[$i1] > $length) {
                  $length = $l[$i1];
                  @rv = [ $i1, $i2, $length ];
               }
               elsif ($l[$i1] == $length) {
                  push @rv, [ $i1, $i2, $length ];
               }
            }
         } else {
            $l[$i1] = 0;
            $k[$i1] = 0;
         }
      }
   }

   for (@rv) {
      $_->[0] -= $length;
      $_->[1] -= $length;
   }

   return @rv;
}

{
   my $s1 = "peachbananaapplepear";
   my $s2 = "juicenaapplewatermelonpear";
   my $key = "apple";

   for (lcsubstr_which_include($s1, $s2, $key)) {
      my ($s1_pos, $s2_pos, $length) = @$_;
      say substr($s1, $s1_pos, $length);
   }
}

O(NM)中的这个解决方案,意味着它可以很好地扩展(它的功能)。