在perl中增加字符串的最快方法是什么?

时间:2017-02-19 20:25:34

标签: string perl

我想以快速的方式在循环中在perl中追加一个字符串,而不必为每次迭代复制字符串。我正在寻找像Java或C#的StringBuilder。

我目前知道以下备选方案,以便做'a + = b'。

  1. a。= b#concat
  2. a = join('',a,b); #join
  3. 推@a,b#array push
  4. 我对将所有字符串复制到另一个字符串不感兴趣。我需要每次复制一个字符,或者在迭代时附加小字符串。我试图解决以下问题:将输入字符串'aaabbccc'压缩为'3a2b3c'。所以想法是迭代输入字符串,检查我们有多少重复字符,然后以压缩方式附加到输出。在Perl中执行此操作最有效的是什么?

    Here is a link我试图解决的问题。我虽然略有不同。

2 个答案:

答案 0 :(得分:4)

为了比较,我尝试测试不同版本以解决压缩字符串的实际问题。这是我的测试脚本test.pl

use strict;
use warnings;

use Benchmark qw(cmpthese);
use Inline C => './compress_c.c';

my $str_len = 10000;
my @chars = qw(a b c d);
my $str;
$str .= [@chars]->[rand 4] for 1 .. $str_len;

cmpthese(
    -1,
    {
        compress_array => sub { compress_array( $str ) },
        compress_regex => sub { compress_regex( $str ) },
        compress_str   => sub { compress_str( $str ) },
        compress_c     => sub { compress_c( $str ) },
    }
);

# Suggested by @melpomene in the comments   
sub compress_regex {
    return $_[0] =~ s/([a-z])\1+/($+[0] - $-[0]) . $1/egr;
}

sub compress_array {
    my $result = '';

    my @chrs = split //, $_[0];

    my $prev = $chrs[0];
    my $count = 1;
    my @result;
    for my $i ( 1..$#chrs ) {
        my $char = $chrs[$i];
        if ( $prev eq $char ) {
            $count++;
            next if $i < $#chrs;
        }
        if ( $count > 1) {
            push @result, $count, $prev;
        }
        else {
            push @result, $prev;
        }
        if ( ( $i == $#chrs ) and ( $prev ne $char ) ) {
            push @result, $char;
            last;
        }
        $count = 1;
        $prev = $char;
    }

    return join '', @result;
}

sub compress_str {
    my $result = '';
    my $prev = substr $_[0], 0, 1;
    my $count = 1;
    my $lastind = (length $_[0]) - 1;
    for my $i (1 .. $lastind) {
        my $char = substr $_[0], $i, 1;
        if ( $prev eq $char ) {
            $count++;
            next if $i < $lastind;
        }

        if ( $count > 1) {
            $result .= $count;
        }
        $result .= $prev;
        if ( ( $i == $lastind ) and ( $prev ne $char ) ) {
            $result .= $char;
            last;
        }
        $count = 1;
        $prev = $char;
    }

    return $result;
}

其中compress_c.c是:

SV *compress_c(SV* str_sv) {
    STRLEN len;
    char* str = SvPVbyte(str_sv, len);

    SV* result = newSV(len);
    char *buf = SvPVX(result);

    char prev = str[0];
    int count = 1;
    int j = 0;
    int i;
    for (i = 1; i < len; i++ )
    {
    char cur = str[i];
        if ( prev == cur ) {
            count++;
            if ( i < (len - 1) )
                continue;
        }

        if ( count > 1) {
            buf[j++] = count + '0';  // assume count is less than 10
        }

        buf[j++] = prev;
        if ( (i == (len - 1)) && (prev != cur) ) buf[j++] = cur;
        count = 1;
        prev = cur;
    }

    buf[j] = '\0';
    SvPOK_on(result);
    SvCUR_set(result, j);
    return result;
}

运行perl test.pl

的结果
                  Rate compress_array  compress_str compress_regex    compress_c
compress_array   311/s             --          -42%           -45%          -99%
compress_str     533/s            71%            --            -6%          -98%
compress_regex   570/s            83%            7%             --          -98%
compress_c     30632/s          9746%         5644%          5273%            --

这表明正则表达式版本比字符串版本略快。但是,C版本速度最快,速度是正则表达式版本的50倍。

注意:我在我的Ubuntu 16.10笔记本电脑(英特尔酷睿i7-7500U CPU @ 2.70GHz)上进行了测试

答案 1 :(得分:2)

我已经通过多种方式执行了以下基准测试:

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);

my $dna;
$dna .= [qw(G A T C)]->[rand 4] for 1 .. 10000;

sub frequency_concat {
    my $result = '';

    for my $idx (0 .. length($dna) - 1) {
            $result .= substr($dna, $idx, 1);
    }

    return $result;
 }

 sub frequency_join {
    my $result = '';

    for my $idx (0 .. length($dna) - 1) {
            $result = join '', $result, substr($dna,$idx,1);
    }

    return $result;
}

sub frequency_list_push {
       my @result = ();

       for my $idx (0 .. length($dna) - 1) {
               push @result, substr($dna,$idx,1);
       }

       return join '', @result;
 }

 sub frequency_list_prealloc {
            my @result = (' ' x length($dna));

            for my $idx (0 .. length($dna) - 1) {
                    $result[$idx] = substr($dna,$idx,1);
            }

            return join '', @result;
 }


cmpthese(-1, # Run each for at least 1 second(s)   {
               concat => \&frequency_concat,
               join => \&frequency_join,
               list_push => \&frequency_list_push,
               list_list_prealloc => \&frequency_list_prealloc
       }
   );

以下结果表明concat(a.b)是最快的操作。我不明白为什么,因为这需要制作字符串的几个副本。

                    Rate         join   list_push list_list_prealloc          concat
join               213/s           --        -38%               -41%        -74%
list_push          342/s          60%          --                -5%        -58%
list_list_prealloc 359/s          68%          5%                 --        -56%
concat             822/s         285%        140%               129%          --