计算Perl中的串联重复次数

时间:2019-06-07 08:42:21

标签: perl count substring

我正在尝试编写给出该字符串的代码:

“ TTGCATCCCTAAAGGGATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCTTTGTGATCAA”

查找子字符串ATC的连续重复(别名tandem repeats),对它们进行计数,如果大于10,则输出消息“关”

这是我的代码:

my @count = ($content =~ /ATC+/g);
print @count . " Repeat length\n";

$nrRepeats = scalar(@count);    
if ($nrRepeats>10) {
    print("Off\n");
}
else {
    print("On\n");
}

并发症:
它计算字符串中存在的所有ATC子字符串,而不是仅重复串联ATC。

非常感谢您的帮助!

4 个答案:

答案 0 :(得分:4)

您的问题有点模棱两可。我将分别回答每种解释。

  1. 如果要确定字符串是否连续包含10个以上的ATC,可以使用

    if ($content =~ /ATCATCATCATCATCATCATCATCATCATCATC/)
    

    此正则表达式可以更紧凑地编写为

    if ($content =~ /(?:ATC){11}/)
    
  2. 如果您要计算连续至少2个ATC的出现次数,可以使用

    my $count = () = $content =~ /(?:ATC){2,}/g;
    if ($count > 10)
    

    (请参见perldoc -q count。)

答案 1 :(得分:1)

您的正则表达式/ATC+/g正在寻找AT,然后是一个或多个C,我怀疑您想要的是这个

/(ATC(?:ATC)+)/g

哪个是ATC,然后是一个或多个ATC

答案 2 :(得分:1)

Perl是一种可识别重复的编程语言,旨在克服重复的手工工作。因此,您可以编写将模式重复为$pattern x $repetitions或直接键入'ATC'x11的字符串。

除了通过/(?:ATC){11}/as already suggested)进行匹配之外,这是获得关闭的另一种方法:

print "Off\n" if $content =~ ("ATC" x 11);

要匹配ATC 的所有串联重复序列,如果重复序列超过10个,则 [1] 必须循环循环:

while ($content =~ /(ATC(?:ATC)+)/g) {
    my $count = (length $1) / 3;
    print "$count repeat length\n";
    print "Off\n" if $count > 10;
}

否则,对于诸如$prefix.ATCx2.$infix.ATCx11.$postfix之类的输入,检测将在第一个串联重复中停止。对captured match $1的预定义引用用于检查匹配长度。


以下

[1] 总共计算ATC的出现,而忽略它们是否连续:

my $count = () = $content =~ /ATC/g;
print "count (total matches) $count\n";

答案 3 :(得分:0)

#!/usr/bin/env perl
use strict;
use warnings;
# The string with the text to match
my $content = "TTGCATCCCTAAAGGGATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCTTTGTGATCAA";
# Split the text in every point preceded or followed by ATC
my @array = split /(?:(?<=ATC)|(?=ATC))/, $content;
# Creates an array which first element is 0 to contain every number of consecutives matches of ATC
my @count = 0;
for (@array) {
    if (/^ATC$/) {
# If ATC matches $_ increment by one the number of matches
        $count[-1]++;
    } else {
# If not and the script is counting a previous ATC sequence 
# we reset the counter adding a new element
        $count[-1] != 0 and push @count, 0;
    }
}
# Initialices $max and $index to 0 and undef respectively
my ($max,$index) = (0, undef);
for (keys @count) {
# If $max has less value than the current iterated sequence 
# $max is updated to current value and so is $index
    $max < $count[$_] and ($max, $index) = ($count[$_], $_);
}
# $index won't be defined if no value of ATC exists
defined $index and print "$max Repeat length\n";
# prints Off is the max match is greater or equal than 10
print(($max>=10?'Off':'On')."\n");

我认为这是一个好方法,因为它可以让您知道更多数据,例如重复次数。

编辑:已更新注释。