在DNA序列中寻找反向重复序列

时间:2014-12-01 04:31:54

标签: regex perl bioinformatics dna-sequence

我有一长串DNA序列,我需要找到由间隔序列侧翼的两个回文序列组成的区域。

输入是:

  

cgtacacgagtagtcgtagctgtcagtcgatcgtacgtacgtagctgctgtagcactatcgaccccacacgtgtgtacacgatgcacagtcgtctatcacatgctagcgctgcccgtacgGATGGCCAAGGCCATCcgatcgctagctagcgccgcgcgtagcccgatcgagacatgctagcagttgtgctgatgtcgagatagctgtgatgcgatgctagcgccgcctagccgcctcgtgtaggctggatgcga的 tcgatcgatgctagcggcgcgatcga tgcactagccgtagcgctagctgatcgatcgtaGATGGCCAAGGCCATCcgcgtagatacgacatgccgggggtatataa

这是我的代码:

use strict;
use warnings;

my $input= $ARGV[0];
chomp $input;
open (my $fh_in, "<", $input) or die "Cannot open file $input $!";
my $dna= <$fh_in>;
chomp $dna;

#######################################################################################

if ($dna=~ /[^ACGT]/gi) {
        print "This is not a valid DNA sequence, it has unknown base(s)\n";
}

$dna=~ tr/[acgt]/[ACGT]/;


######################################################################################

print "Minimum length of palindromic sequence?\n";
my $min= <STDIN>;
chomp $min;

print "Maximum length of palindromic sequence?\n";
my $max= <STDIN>;
chomp $max;

print "Minimum length of spacer region?\n";
my $min_spacer= <STDIN>;
chomp $min_spacer;

print "Maximum length of spacer region?\n";
my $max_spacer= <STDIN>;
chomp $max_spacer;

###################################################################################### 

my $dna_length= length($dna);
my ($length , $offset , $string_1 , $string_2);

for ($offset= 0 ; $offset <= $dna_length-$max-$max-$max_spacer ; $offset++) {
        for ($length= $min ; $length <= $max ; $length++) {
                $string_1= substr ($dna, $offset, $length);
                $string_2= reverse $string_1;
                $string_2=~ tr/[ACGT]/[TGCA]/;
                if ($dna=~ /(($string_1)([ACGT]{$min_spacer,$max_spacer})($string_2))/) {
                        print "IR starts at $offset => $2***$3***$4\n$1\n\n";
                }
        }
}

带参数: $ min = 6,$ max = 12,$ min_spacer = 4,$ max_spacer = 12 我得到的输出是:

IR starts at 26 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA

IR starts at 27 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 118 => CGGATG***GCCAAGGC***CATCCG
CGGATGGCCAAGGCCATCCG

IR starts at 118 => CGGATGG***CCAAGG***CCATCCG
CGGATGGCCAAGGCCATCCG

IR starts at 118 => CGGATGGC***CAAG***GCCATCCG
CGGATGGCCAAGGCCATCCG

IR starts at 119 => GGATGG***CCAAGG***CCATCC
GGATGGCCAAGGCCATCC

IR starts at 119 => GGATGGC***CAAG***GCCATCC
GGATGGCCAAGGCCATCC

IR starts at 120 => GATGGC***CAAG***GCCATC
GATGGCCAAGGCCATC

IR starts at 136 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 164 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 252 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 254 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 254 => ATCGATCG***ATGCTAGCGGCG***CGATCGAT
ATCGATCGATGCTAGCGGCGCGATCGAT

IR starts at 255 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA

IR starts at 256 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 258 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 274 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 276 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 304 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 304 => ATCGATCG***ATGCTAGCGGCG***CGATCGAT
ATCGATCGATGCTAGCGGCGCGATCGAT

IR starts at 305 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA

IR starts at 306 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 314 => GATGGC***CAAG***GCCATC
GATGGCCAAGGCCATC

然而,当我检查我的第一次点击的区域(在输入中以粗体突出显示)时,此命中的偏移似乎不在位置26.有人能告诉我我的代码有什么问题吗?感谢。

2 个答案:

答案 0 :(得分:1)

你的问题是你的正则表达式正在序列中的任何地方寻找一个回文,而不只是在偏移的位置。 &#34; ATCGATCG&#34;发生在偏移26处,因此匹配。您需要向正则表达式添加一些位置信息。尝试像

这样的东西
/^[ACGT]{$offset}(($string_1)([ACGT]{$min_spacer,$max_spacer})($string_2))/

答案 1 :(得分:1)

这是一个解决方案,它使用实验(??{})功能,据说很长一段时间都会改变,但还没有。

工作原理:它从正则表达式中调用子例程convert,并将第一个匹配组转换为outputstring的所需正则表达式。其余的(回溯等)由正则表达式引擎处理。遗憾的是,将变量插入到分隔长度并不适合正则表达式解析,所以我不得不使用字符串来做到这一点。如果可能的话,请不要这样做。

use warnings;
use strict;
use 5.01;
use re 'eval'; # needed, because of (??{})

my %c=
  (min_pali   =>    (shift) // 6,
   max_pali   =>    (shift) // 12,
   min_spacer =>    (shift) // 4,
   max_spacer =>    (shift) // 12,
  );

my $re1 = "(.{$c{min_pali},$c{max_pali}})(.{$c{min_spacer},$c{max_spacer}})(??{convert})";
while(<DATA>){
  chomp;
  $_ = uc $_;
  my $converted;
  sub convert {
    my $var = reverse $1;
    $var =~ tr{ACGT}{TGCA};
    $converted =  $var;
  }
  while (/$re1/g) {
    printf "%3d => %s**%s**%s\n", $-[0],$1,$2,$converted;
    pos = $-[0] + 1; # start next match one character after the last match start
  }
}

__DATA__
cgtacacgagtagtcgtagctgtcagtcgatcgtacgtacgtagctgctgtagcactatcgaccccacacgtgtgtacacgatgcacagtcgtctatcacatgctagcgctgcccgtacgGATGGCCAAGGCCATCcgatcgctagctagcgccgcgcgtagcccgatcgagacatgctagcagttgtgctgatgtcgagatagctgtgatgcgatgctagcgccgcctagccgcctcgtgtaggctggatgcgatcgatcgatgctagcggcgcgatcgatgcactagccgtagcgctagctgatcgatcgtaGATGGCCAAGGCCATCcgcgtagatacgacatgccgggggtatataa

输出:

118 => CGGATG**GCCAAGGC**CATCCG
119 => GGATGG**CCAAGG**CCATCC
120 => GATGGC**CAAG**GCCATC
254 => ATCGATCG**ATGCTAGCGGCG**CGATCGAT
255 => TCGATCG**ATGCTAGCGGCG**CGATCGA
256 => CGATCG**ATGCTAGCGGCG**CGATCG
258 => ATCGAT**GCTAGCGGCGCG**ATCGAT
314 => GATGGC**CAAG**GCCATC

此外,我不确定这是否是一个问题,但你可以产生更长的palidrome序列,只是通过这个解决方案转移到间隔区:

 Assuming length 2 – 4, spacer= 2 – 4 (X's are unintresting bits)
 ACACAXXTGTGT => ACAC**AXXT**GTGT