我有两条长度相等的字符串,我需要比较一下。 我想找到重叠基(。)和内部间隙(*)。以下是示例:
------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC
-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---
................**.................
重叠次数= 33。 内部间隙数= 2。
找到重叠次数没问题。但我有问题 找到内部差距。以下是我目前的代码。它非常缓慢。 原则上我需要计算数百万这样的对。
#!/usr/bin/perl -w
my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";
print "$s1\n";
print "$s2\n";
my %base = ("A" => 1, "T" => 1, "C" => 1, "G" => 1);
my $ovlp_basecount = 0;
my $internal_gap = 0;
foreach my $si ( 0 .. length($s1) ) {
my $base1 = substr($s1,$si,1);
my $base2 = substr($s2,$si,1);
# Overlap
if ( $base{$base1} && $base{$base2} ) {
$ovlp_basecount++;
}
# Not sure how to compute internal gap
}
print "TOTAL OVERLAP BASE = $ovlp_basecount\n";
print "TOTAL Internal Gap \?\n";
请建议如何有效地找到内部差距和重叠。
答案 0 :(得分:3)
您可以对字符串使用按位OR来查找一个字符串中与另一个字符串中的空白区域重叠的区域。此过程还具有通过将非重叠字符转换为小写来显示重叠的效果,从而使得重叠的查找非常简单:
#!/usr/bin/perl
use strict;
use warnings;
my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";
$s1 =~ tr/-/\x20/;
$s2 =~ tr/-/\x20/;
my $or = $s1 | $s2;
(my $gap) = $or =~ m/^.*[ACTG]([actg]+)[ACTG].*$/;
(my $overlap = $or) =~ s/[^A-Z]//g;
print "s1: '$s1'\n";
print "s2: '$s2'\n";
print "OR: '$or'\n";
printf "Gap: '%s' (%d)\n", $gap, length $gap;
printf "Overlap '%s' (%d)\n", $overlap, length $overlap;
打印:
s1: ' ACTAAAAATACAAAAA TTAGCCAGGCGTGGTGGCAC'
s2: ' TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG '
OR: ' tACTAAAAATACAAAAAaaTTAGCCAGGWGTGGTGGcac'
Gap: 'aa' (2)
Overlap 'ACTAAAAATACAAAAATTAGCCAGGWGTGGTGG' (33)
有关字符串按位操作的更多信息:
http://teaching.idallen.com/cst8214/08w/notes/bit_operations.txt
答案 1 :(得分:1)
假设间隙从不重叠,您可以使用正则表达式解决此问题。这是您s1
的答案。
echo '------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC' | perl -ne '$s = 0; foreach(/[GTAC](-+)[GTAC]/) { $s += length($1); } print "$s\n";'
2