我有一长串DNA序列,我需要找到由间隔序列侧翼的两个回文序列组成的区域。
输入是:
cgtacacgagtagtcgtagctgtcagtcgatcgtacgtacgtagctgctgtagcactatcgaccccacacgtgtgtacacgatgcacagtcgtctatcacatgctagcgctgcccgtacgGATGGCCAAGGCCATCcgatcgctagctagcgccgcgcgtagcccgatcgagacatgctagcagttgtgctgatgtcgagatagctgtgatgcgatgctagcgccgcctagccgcctcgtgtaggctggatgcga的 tcgatcgatgctagcggcgcgatcga tgcactagccgtagcgctagctgatcgatcgtaGATGGCCAAGGCCATCcgcgtagatacgacatgccgggggtatataa
这是我的代码:
use strict;
use warnings;
my $input= $ARGV[0];
chomp $input;
open (my $fh_in, "<", $input) or die "Cannot open file $input $!";
my $dna= <$fh_in>;
chomp $dna;
#######################################################################################
if ($dna=~ /[^ACGT]/gi) {
print "This is not a valid DNA sequence, it has unknown base(s)\n";
}
$dna=~ tr/[acgt]/[ACGT]/;
######################################################################################
print "Minimum length of palindromic sequence?\n";
my $min= <STDIN>;
chomp $min;
print "Maximum length of palindromic sequence?\n";
my $max= <STDIN>;
chomp $max;
print "Minimum length of spacer region?\n";
my $min_spacer= <STDIN>;
chomp $min_spacer;
print "Maximum length of spacer region?\n";
my $max_spacer= <STDIN>;
chomp $max_spacer;
######################################################################################
my $dna_length= length($dna);
my ($length , $offset , $string_1 , $string_2);
for ($offset= 0 ; $offset <= $dna_length-$max-$max-$max_spacer ; $offset++) {
for ($length= $min ; $length <= $max ; $length++) {
$string_1= substr ($dna, $offset, $length);
$string_2= reverse $string_1;
$string_2=~ tr/[ACGT]/[TGCA]/;
if ($dna=~ /(($string_1)([ACGT]{$min_spacer,$max_spacer})($string_2))/) {
print "IR starts at $offset => $2***$3***$4\n$1\n\n";
}
}
}
带参数: $ min = 6,$ max = 12,$ min_spacer = 4,$ max_spacer = 12 我得到的输出是:
IR starts at 26 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA
IR starts at 27 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 118 => CGGATG***GCCAAGGC***CATCCG
CGGATGGCCAAGGCCATCCG
IR starts at 118 => CGGATGG***CCAAGG***CCATCCG
CGGATGGCCAAGGCCATCCG
IR starts at 118 => CGGATGGC***CAAG***GCCATCCG
CGGATGGCCAAGGCCATCCG
IR starts at 119 => GGATGG***CCAAGG***CCATCC
GGATGGCCAAGGCCATCC
IR starts at 119 => GGATGGC***CAAG***GCCATCC
GGATGGCCAAGGCCATCC
IR starts at 120 => GATGGC***CAAG***GCCATC
GATGGCCAAGGCCATC
IR starts at 136 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 164 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 252 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 254 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT
IR starts at 254 => ATCGATCG***ATGCTAGCGGCG***CGATCGAT
ATCGATCGATGCTAGCGGCGCGATCGAT
IR starts at 255 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA
IR starts at 256 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 258 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT
IR starts at 274 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 276 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT
IR starts at 304 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT
IR starts at 304 => ATCGATCG***ATGCTAGCGGCG***CGATCGAT
ATCGATCGATGCTAGCGGCGCGATCGAT
IR starts at 305 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA
IR starts at 306 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG
IR starts at 314 => GATGGC***CAAG***GCCATC
GATGGCCAAGGCCATC
然而,当我检查我的第一次点击的区域(在输入中以粗体突出显示)时,此命中的偏移似乎不在位置26.有人能告诉我我的代码有什么问题吗?感谢。
答案 0 :(得分:1)
你的问题是你的正则表达式正在序列中的任何地方寻找一个回文,而不只是在偏移的位置。 &#34; ATCGATCG&#34;发生在偏移26处,因此匹配。您需要向正则表达式添加一些位置信息。尝试像
这样的东西/^[ACGT]{$offset}(($string_1)([ACGT]{$min_spacer,$max_spacer})($string_2))/
答案 1 :(得分:1)
这是一个解决方案,它使用实验(??{})
功能,据说很长一段时间都会改变,但还没有。
工作原理:它从正则表达式中调用子例程convert
,并将第一个匹配组转换为outputstring的所需正则表达式。其余的(回溯等)由正则表达式引擎处理。遗憾的是,将变量插入到分隔长度并不适合正则表达式解析,所以我不得不使用字符串来做到这一点。如果可能的话,请不要这样做。
use warnings;
use strict;
use 5.01;
use re 'eval'; # needed, because of (??{})
my %c=
(min_pali => (shift) // 6,
max_pali => (shift) // 12,
min_spacer => (shift) // 4,
max_spacer => (shift) // 12,
);
my $re1 = "(.{$c{min_pali},$c{max_pali}})(.{$c{min_spacer},$c{max_spacer}})(??{convert})";
while(<DATA>){
chomp;
$_ = uc $_;
my $converted;
sub convert {
my $var = reverse $1;
$var =~ tr{ACGT}{TGCA};
$converted = $var;
}
while (/$re1/g) {
printf "%3d => %s**%s**%s\n", $-[0],$1,$2,$converted;
pos = $-[0] + 1; # start next match one character after the last match start
}
}
__DATA__
cgtacacgagtagtcgtagctgtcagtcgatcgtacgtacgtagctgctgtagcactatcgaccccacacgtgtgtacacgatgcacagtcgtctatcacatgctagcgctgcccgtacgGATGGCCAAGGCCATCcgatcgctagctagcgccgcgcgtagcccgatcgagacatgctagcagttgtgctgatgtcgagatagctgtgatgcgatgctagcgccgcctagccgcctcgtgtaggctggatgcgatcgatcgatgctagcggcgcgatcgatgcactagccgtagcgctagctgatcgatcgtaGATGGCCAAGGCCATCcgcgtagatacgacatgccgggggtatataa
输出:
118 => CGGATG**GCCAAGGC**CATCCG
119 => GGATGG**CCAAGG**CCATCC
120 => GATGGC**CAAG**GCCATC
254 => ATCGATCG**ATGCTAGCGGCG**CGATCGAT
255 => TCGATCG**ATGCTAGCGGCG**CGATCGA
256 => CGATCG**ATGCTAGCGGCG**CGATCG
258 => ATCGAT**GCTAGCGGCGCG**ATCGAT
314 => GATGGC**CAAG**GCCATC
此外,我不确定这是否是一个问题,但你可以产生更长的palidrome序列,只是通过这个解决方案转移到间隔区:
Assuming length 2 – 4, spacer= 2 – 4 (X's are unintresting bits)
ACACAXXTGTGT => ACAC**AXXT**GTGT