简单的字符串压缩和比较

时间:2012-07-18 13:22:21

标签: string perl recursion compression

我正在尝试比较两个字符串,并且作为输出我想要连续相同字符的计数,如果字符不同,则只是第二个字符串中的字符。我有一个有效的递归实现,但我无法弄清楚如何一起添加连续计数

代码:

use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Indent = 0;
$Data::Dumper::Terse  = 1;

my $str1 = "aaaaaaaaaaaabbbbbbbbbbbccccccccdddddddddddeeeefffffff";
my $str2 = "aaaaaaaaaaaabbbbbbbbbbbccccccccxxxxxxxddxxeeeefffffff";

sub find_diff {
    my ( $a, $b ) = @_;
    my @rtn = ();
    my $len = length $a;
    my $div = $len / 2;
    if ( $div < 1 ) {
        return $b;
    }
    my $a_1 = substr $a, 0, $div;
    my $b_1 = substr $b, 0, $div;
    if ($a_1 eq $b_1) {
         push @rtn, length $a_1;
    }
    else {
        push @rtn, find_diff( $a_1, $b_1 );
    }
    my $a_2 = substr $a, $div;
    my $b_2 = substr $b, $div;
    if ($a_2 eq $b_2) {
        push @rtn, length $a_2;
    }
    else {
        push @rtn, find_diff( $a_2, $b_2 );
    }
    return @rtn;
}

print Data::Dumper::Dumper( [ find_diff('xaabbb', 'aaabbc' ) ] ) . "\n";
print Data::Dumper::Dumper( [ find_diff('aaabbb', 'aaabbc' ) ] ) . "\n";
print Data::Dumper::Dumper( [ find_diff( $str1, $str2 ) ] ) . "\n";

输出:

['a',2,1,1,'c']
[3,1,1,'c']
[26,3,1,1,'x','x','x','x','x','x','x',1,1,'x','x',4,7]

期望的输出:

['a',4,'c']
[5,'c']
[31,'x','x','x','x','x','x','x',2,'x','x',11]

当然我可以将字符拆分成一个unpack的数组,然后相当容易地计算连续匹配,但我想尝试一种分而治之的方法,以便我可以比较性能。

谢谢!

编辑 - 管理通过返回嵌套数组然后减少来在递归情况下解决它。令人惊讶的是,这并不慢:

sub find_diff {
    my ( $a, $b ) = @_;
    my @rtn = ();
    my $len = length $a;
    if ( $len < 2 ) {
        return [$b, 0];
    }
    my $div = $len / 2;
    my $a_1 = substr $a, 0, $div;
    my $b_1 = substr $b, 0, $div;
    if ($a_1 eq $b_1) {
        push @rtn, [length $a_1, 1];
    }
    else {
        push @rtn, find_diff( $a_1, $b_1 );
    }
    my $a_2 = substr $a, $div;
    my $b_2 = substr $b, $div;
    if ($a_2 eq $b_2) {
        push @rtn, [length $a_2, 1];
    }
    else {
        push @rtn, find_diff( $a_2, $b_2 );
    }
    return @rtn;
}
sub compress_string {
    my ($a, $b) = @_;
    my @list = find_diff($a, $b);
    my $acc = 0;
    my @result = ();
    foreach my $item (@list) {
        if ( $item->[1] ) {
            $acc += $item->[0];
        } else {
            push @result, if $acc;
            push @result, $item->[0];
            $acc = 0;
        }
    }
    push @result, $acc if $acc;
    return @result;
}

结果符合我的要求。

更新 - 效果统计

这真的很有趣。使用unpack( 'C*', $string)非常快,我认为这就是为什么我的迭代版本如此快速。递归的速度优势来自更长的字符串(434个字符)

                         Rate short_recurse_borodin short_recurse short_array_borodin short_array_sodved short_array
short_recurse_borodin  6944/s                    --          -31%                -36%               -73%        -84%
short_recurse         10091/s                   45%            --                 -8%               -61%        -76%
short_array_borodin   10929/s                   57%            8%                  --               -57%        -74%
short_array_sodved    25707/s                  270%          155%                135%                 --        -40%
short_array           42553/s                  513%          322%                289%                66%          --
                      Rate mid_array_borodin mid_recurse_borodin mid_string mid_array_sodved mid_array
mid_array_borodin   1418/s                --                -28%       -56%             -65%      -82%
mid_recurse_borodin 1972/s               39%                  --       -39%             -52%      -76%
mid_recurse         3226/s              127%                 64%         --             -21%      -60%
mid_array_sodved    4082/s              188%                107%        27%               --      -49%
mid_array           8065/s              469%                309%       150%              98%        --
                       Rate long_array_borodin long_array_sodved long_recurse_borodin long_array long_string
long_array_borodin    172/s                 --              -67%                 -80%       -85%        -89%
long_array_sodved     513/s               199%                --                 -40%       -55%        -67%
long_recurse_borodin  854/s               397%               66%                   --       -25%        -45%
long_array           1142/s               564%              122%                  34%         --        -26%
long_recurse         1546/s               800%              201%                  81%        35%          --

3 个答案:

答案 0 :(得分:1)

编辑:哎呀,对不起。刚刚看到你评论想要使用递归和拆分字符串。所以我的回答并不合适,抱歉。无论如何,我会离开它。

我认为你不需要递归。以下作品

use Data::Dumper;

sub find_diff($$)
{
    my( $a, $b ) = @_;
    my @res;
    my @a = split( '', $a );
    my @b = split( '', $b );
    # Assume a and b are the same length
    my $mcount = 0;
    for( my $i = 0; $i < scalar(@a); $i++ )
    {
        if( $a[$i] eq $b[$i] )
        {
            $mcount++;
        }
        else
        {
            if( $mcount )
            {
                push( @res, $mcount );
            }
            $mcount = 0;
            push( @res, $b[$i] );
        }
    }
    if( $mcount )
    {
        push( @res, $mcount );
    }
    return @res;
} # END find_diff

print Data::Dumper::Dumper( [ find_diff('xaabbb', 'aaabbc' ) ] ) . "\n";
print Data::Dumper::Dumper( [ find_diff('aaabbb', 'aaabbc' ) ] ) . "\n";

答案 1 :(得分:1)

尽管我有所保留,但我已经更新了我的解决方案以显示递归方法。基准测试取决于您!请发布您的结果。

递归或分而治之的方法不适合此问题。最后,必须比较每对字符并评估连续匹配字符的数量。无论是一次性完成这一切还是将字符串分成两部分,分别处理每一半,并重新组合结果,都没有区别。事实上,由于分割和组合中间结果所需的代码,递归解决方案绑定更慢。

应该通过将两个字符串分成单个字符并比较两个序列中的每对字符来解决此问题。

此解决方案似乎可以执行所需的操作,并且还可以解释两个字符串长度不同的情况。

use strict;
use warnings;

use Data::Dump;

my $str1 = "aaaaaaaaaaaabbbbbbbbbbbccccccccdddddddddddeeeefffffff";
my $str2 = "aaaaaaaaaaaabbbbbbbbbbbccccccccxxxxxxxddxxeeeefffffff";

dd [ find_diff( 'xaabbb', 'aaabbc' ) ];
dd [ find_diff( 'aaabbb', 'aaabbc' ) ];
dd [ find_diff( $str1, $str2 ) ];
dd [ find_diff( 'xxx', 'xx' ) ];

sub find_diff {

  my @str1 = unpack '(A1)*', shift;
  my @str2 = unpack '(A1)*', shift;
  my @return;
  my $nmatch;

  while (@str1 or @str2) {
    my @pair = map $_ // '', ( shift(@str1), shift(@str2) );
    if ($pair[0] eq $pair[1]) {
      $nmatch++;
    }
    else {
      push @return, $nmatch if $nmatch;
      undef $nmatch;
      push @return, $pair[1];
    }
  }
  push @return, $nmatch if $nmatch;

  return @return;
}

<强>输出

["a", 4, "c"]
[5, "c"]
[31, "x", "x", "x", "x", "x", "x", "x", 2, "x", "x", 11]
[2, ""]

<强>更新

为满足您对类似递归解决方案的请求,此子例程使用递归方法执行相同操作。它产生相同的,除非它提供了一对比较具有不同长度的字符串。

请注意,它依赖于原始字符串中的数据完全是非数字的。如果情况并非如此,则问题会变得更加复杂。

更新2

我修改了recursive_find_diff来正确处理包含数字字符的字符串。它依赖于结果列表的成员都是单个字符,除非它们是匹配字符的计数。所以我在所有匹配计数之前添加了+,以使它们始终比一个字符长,并且很容易区分。

确定所有这些复杂功能将比简单的解决方案慢得多!

use strict;
use warnings;

use Data::Dump;

my $str1 = "aaaaaaaaaaaabbbbbbbbbbbccccccccdddddddddddeeeefffffff";
my $str2 = "aaaaaaaaaaaabbbbbbbbbbbccccccccxxxxxxxddxxeeeefffffff";

dd [ recursive_find_diff( 'xaabbb', 'aaabbc' ) ];
dd [ recursive_find_diff( 'aaabbb', 'aaabbc' ) ];
dd [ recursive_find_diff( $str1, $str2 ) ];
dd [ recursive_find_diff( '111222444888', '11122233488x' ) ];

sub recursive_find_diff {

  my ($str1, $str2) = @_;
  my $len = length $str1;

  die "Strings for comparison must be of equal lengths" unless length $str2 == $len;

  if ($str1 eq $str2) {
    return ( '+'.$len );
  }
  elsif ($len == 1) {
    return $str1 eq $str2 ? ( '+1' ) : ( $str2 );
  }
  else {
    my $half = int($len / 2);
    my @part1 = recursive_find_diff(substr($str1, 0, $half), substr($str2, 0, $half));
    my @part2 = recursive_find_diff(substr($str1, $half), substr($str2, $half));
    if (length $part1[-1] >1 and length $part2[0] > 1) {
      $part2[0] = '+'.($part2[0] + pop @part1);
    }
    return ( @part1, @part2 );
  }
}

<强>输出

["a", "+4", "c"]
["+5", "c"]
["+31", "x", "x", "x", "x", "x", "x", "x", "+2", "x", "x", "+11"]
["+6", 3, 3, "+3", "x"]

答案 2 :(得分:1)

感谢Borodin和Sodved我已经改进了我的解决方案,以至于它非常快。由于我比较的字符串是除了更改值之外几乎相同的日志消息,因此使用递归解决方案会消除大量工作。

正如Sodved所提到的,C中没有类似的收益,因为我仍然需要进行逐字符比较。

它现在做的是检查字符串的长度是否低于某个阈值,如果是,则返回数组比较。

性能如下:

                        Rate          long_recurse long_recurse_fallback
long_recurse          1613/s                    --                  -18%
long_recurse_fallback 1961/s                   22%                    --

这是我的最终代码(删除了测试字符串,它们是真正的日志消息):

use strict;
use warnings;
use Data::Dumper;
use Benchmark qw(cmpthese);
$Data::Dumper::Indent = 0;
$Data::Dumper::Terse  = 1;

my $str1 = "aaaaaaaaaaaabbbbbbbbbbbccccccccdddddddddddeeeefffffff";
my $str2 = "aaaaaaaaaaaabbbbbbbbbbbccccccccxxxxxxxddxxeeeefffffff";

sub find_diff {
    my ( $a, $b, $minlen ) = @_;
    my $len = length $a;
    if ($len < $minlen) {
        return compress_unpack_ary( $a, $b );
    }
    if ( $len < 2 ) {
        return [ord($b), 0];
    }
    my @rtn = ();
    my $div = $len / 2;
    my $a_1 = substr $a, 0, $div;
    my $b_1 = substr $b, 0, $div;
    if ($a_1 eq $b_1) {
        push @rtn, [length $a_1, 1];
    }
    else {
        push @rtn, find_diff( $a_1, $b_1, $minlen );
    }
    my $a_2 = substr $a, $div;
    my $b_2 = substr $b, $div;
    if ($a_2 eq $b_2) {
        push @rtn, [length $a_2, 1];
    }
    else {
        push @rtn, find_diff( $a_2, $b_2, $minlen );
    }
    return @rtn;
}

sub compress_string {
    my ($a, $b, $minlen) = @_;
    my @list = find_diff($a, $b, $minlen);
    my $acc = 0;
    my @result = ();
    foreach my $item (@list) {
        if ( $item->[1] ) {
            $acc += $item->[0];
        } else {
            while ( $acc > 127 ) {
                push @result, 255;
                $acc -= 127;
            }
            push @result, $acc + 128 if $acc;
            push @result, $item->[0];
            $acc = 0;
        }
    }
    while ( $acc > 127 ) {
        push @result, 255;
        $acc -= 127;
    }
    push @result, $acc + 128 if $acc;
    return pack('C*', @result);
}
sub compress_unpack_ary {
    my ( $a, $b ) = @_;
    my @orig       = unpack('C*', $a);
    my @new        = unpack('C*', $b);
    my @nonmatches = ();
    my $count      = 0;
    my $repeats    = 0;
    while ( $count < scalar @new ) {
        if ( $orig[$count] and $new[$count] == $orig[$count] ) {
            $repeats++;
        }
        elsif ( $repeats == 1 ) {
            push @nonmatches, [ $new[$count - 1], 0], [$new[$count], 0];
            $repeats = 0;
        }
        elsif ( $repeats > 1 ) {
            push @nonmatches, [$repeats, 1];
            $repeats = 0;    # reset counter
            push @nonmatches, [$new[$count], 0];
        }
        else {
            push @nonmatches, [$new[$count], 0];
        }
        $count++;
    }
    if ( $repeats > 0 ) {
        push @nonmatches, [$repeats, 1];
    }
    return @nonmatches;
}
print Data::Dumper::Dumper( [ compress_string( $str1, $str2, 20 ) ] ) . "\n";
print Data::Dumper::Dumper( [ compress_string( $str1, $str2, 0 ) ] ) . "\n";
print Data::Dumper::Dumper( [ compress_string( $long_a, $long_b, 20 ) ] ) . "\n";
print Data::Dumper::Dumper( [ compress_string( $long_a, $long_b, 0 ) ] ) . "\n";

cmpthese(1000, {
        'long_recurse' => sub { compress_string($long_a, $long_b, 0 ) },
        'long_recurse_fallback' => sub { compress_string($long_a, $long_b, 20 ) },
        });