perl查找字符串中匹配2个字符的数量

时间:2011-11-18 14:57:33

标签: string perl pattern-matching match bioperl

在perl(不是bioperl)中有一个方法可以找到每2个连续字母的数量

AA, AC,AG,AT,CC,CA...的数量 按照这样的顺序:

$sequence = 'AACGTACTGACGTACTGGTTGGTACGA'

ps:我们可以使用正则表达式手动创建它,即$ GC =($ sequence = ~s / GC / GC / g),它返回序列中GC的数量。 我需要一种自动化和通用的方式 谢谢你的推荐

3 个答案:

答案 0 :(得分:3)

你让我困惑了一段时间,但我认为你想要的是计算给定字符串中的二核苷酸。

<强>代码:

my @dinucs = qw(AA AC AG CC CA CG);
my %count;
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';

for my $dinuc (@dinucs) {
    $count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
}

Data::Dumper的输出:

$VAR1 = {
          "AC" => 5,
          "CC" => "",
          "AG" => "",
          "AA" => 1,
          "CG" => 3,
          "CA" => ""
        };

答案 1 :(得分:3)

接近TLP答案,但没有替换:

my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my @dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}@dinucs;

for my $dinuc (@dinucs) {
    while($sequence=~/$dinuc/g) {
        $count{$dinuc}++;
    }
}

<强>基准:

my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my @dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}@dinucs;

my $count = -3;
my $r = cmpthese($count, {
        'match' => sub {
            for my $dinuc (@dinucs) {
               while($sequence=~/$dinuc/g) {
                    $count{$dinuc}++;
               }
            }
        },
        'substitute' => sub {
            for my $dinuc (@dinucs) {
                $count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
            }
         }
});

<强>输出:

              Rate substitute      match
substitute 13897/s         --       -11%
match      15622/s        12%         --

答案 2 :(得分:0)

如果您小心谨慎,正则表达式可行,但使用substr的简单解决方案将更快,更灵活。

(截至发布时,标记为已接受的正则表达式解决方案无法正确计算重复区域中的二核苷酸,如“AAAA ......”,其中有许多天然序列。一旦匹配'AA',正则表达式搜索在第三个字符上重新开始,跳过中间的'AA'二核苷酸。这不会影响其他dinuc,因为如果你在一个位置有'AC',你肯定不会在下一个基地使用它。问题中给出的特定顺序不会遇到这个问题,因为连续三次没有出现基数。)

我建议的方法更灵活,因为它可以计算任何长度的单词;将正则表达式方法扩展到更长的单词是很复杂的,因为你必须用正则表达式做更多的体操才能得到准确的计数。

sub substrWise {
    my ($seq, $wordLength) = @_;

    my $cnt = {};

    my $w;
    for my $i (0 .. length($seq) - $wordLength) {
        $w = substr($seq, $i, $wordLength);
        $cnt->{$w}++;
    }

    return $cnt;
}

sub regexWise {
    my ($seq, $dinucs) = @_;

    my $cnt = {};
    for my $d (@$dinucs) {
        if (substr($d, 0,1) eq substr($d, 1,1) ) {
            my $n = substr($d, 0,1);
            $cnt->{$d} = ($seq =~ s/$n(?=$n)/$n/g); # use look-ahead
        } else {
            $cnt->{$d} = ($seq =~ s/$d/$d/g);
        }
    }

    return $cnt;
}


my @dinucs = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);

my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';

use Test::More tests => 1;
my $rWise = regexWise($sequence, \@dinucs);
my $sWise = substrWise($sequence, 2);
$sWise->{$_} //= '' for @dinucs; # substrWise will not create keys for words not found
# this seems like desirable behavior IMO, 
# but i'm adding '' to show that the counts match
is_deeply($rWise, $sWise, 'verify equivalence');

use Benchmark qw(:all);
cmpthese(100000, {
    'regex' => sub {
        regexWise($sequence, \@dinucs);
    },
    'substr' => sub {
        substrWise($sequence, 2);
    }

输出:

1..1
ok 1 - verify equivalence
          Rate  regex substr
regex  11834/s     --   -85%
substr 76923/s   550%     --

对于更长的序列(10-100kbase),优势并不明显,但它仍然赢得约70%。