我找到了这个perl脚本但是我有太多的序列需要分析。我想知道是否可以对其进行优化?我在它上面推出了NYTProf,看到部分“计算匹配分数”,“计算差距分数”和“选择最佳分数”需要花费大量时间。我是否必须修改数据结构?谢谢你的帮助。
带有引用的perl脚本:
# Smith-Waterman Algorithm
# from this website http://etutorials.org/Misc/blast/Part+II+Theory/Chapter+3.+Sequence+Alignment/3.2+Local+Alignment+Smith-Waterman/
# Smith-Waterman Algorithm
# usage statement
die "usage: $0 <sequence 1> <sequence 2>\n" unless @ARGV == 2;
# get sequences from command line
my ($seq1, $seq2) = @ARGV;
# scoring scheme
my $MATCH = 1; # +1 for letters that match
my $MISMATCH = -1; # -1 for letters that mismatch
my $GAP = -1; # -1 for any gap
# initialization
my @matrix;
$matrix[0][0]{score} = 0;
$matrix[0][0]{pointer} = "none";
for(my $j = 1; $j <= length($seq1); $j++) {
$matrix[0][$j]{score} = 0;
$matrix[0][$j]{pointer} = "none";
}
for (my $i = 1; $i <= length($seq2); $i++) {
$matrix[$i][0]{score} = 0;
$matrix[$i][0]{pointer} = "none";
}
# fill
my $max_i = 0;
my $max_j = 0;
my $max_score = 0;
for(my $i = 1; $i <= length($seq2); $i++) {
for(my $j = 1; $j <= length($seq1); $j++) {
my ($diagonal_score, $left_score, $up_score);
# calculate match score
my $letter1 = substr($seq1, $j-1, 1);
my $letter2 = substr($seq2, $i-1, 1);
if ($letter1 eq $letter2) {
$diagonal_score = $matrix[$i-1][$j-1]{score} + $MATCH;
}
else {
$diagonal_score = $matrix[$i-1][$j-1]{score} + $MISMATCH;
}
# calculate gap scores
$up_score = $matrix[$i-1][$j]{score} + $GAP;
$left_score = $matrix[$i][$j-1]{score} + $GAP;
if ($diagonal_score <= 0 and $up_score <= 0 and $left_score <= 0) {
$matrix[$i][$j]{score} = 0;
$matrix[$i][$j]{pointer} = "none";
next; # terminate this iteration of the loop
}
# choose best score
if ($diagonal_score >= $up_score) {
if ($diagonal_score >= $left_score) {
$matrix[$i][$j]{score} = $diagonal_score;
$matrix[$i][$j]{pointer} = "diagonal";
}
else {
$matrix[$i][$j]{score} = $left_score;
$matrix[$i][$j]{pointer} = "left";
}
} else {
if ($up_score >= $left_score) {
$matrix[$i][$j]{score} = $up_score;
$matrix[$i][$j]{pointer} = "up";
}
else {
$matrix[$i][$j]{score} = $left_score;
$matrix[$i][$j]{pointer} = "left";
}
}
# set maximum score
if ($matrix[$i][$j]{score} > $max_score) {
$max_i = $i;
$max_j = $j;
$max_score = $matrix[$i][$j]{score};
}
}
}
# trace-back
my $align1 = "";
my $align2 = "";
my $j = $max_j;
my $i = $max_i;
while (1) {
last if $matrix[$i][$j]{pointer} eq "none";
if ($matrix[$i][$j]{pointer} eq "diagonal") {
$align1 .= substr($seq1, $j-1, 1);
$align2 .= substr($seq2, $i-1, 1);
$i--; $j--;
}
elsif ($matrix[$i][$j]{pointer} eq "left") {
$align1 .= substr($seq1, $j-1, 1);
$align2 .= "-";
$j--;
}
elsif ($matrix[$i][$j]{pointer} eq "up") {
$align1 .= "-";
$align2 .= substr($seq2, $i-1, 1);
$i--;
}
}
$align1 = reverse $align1;
$align2 = reverse $align2;
print "$align1\n";
print "$align2\n";
答案 0 :(得分:5)
您可以尝试避免一遍又一遍地做同样的事情。
不是从序列中删除单个字符,而是可以尝试在循环之前将字符序列拆分一次,而不是通过索引使用最可能更快的访问。
示例:
my $string = "Hello, how are you?";
my @chars = split //, $string; # Or: unpack 'a*', $string
print "Eighth char: $chars[7]\n";
my $letter2 = substr($seq2, $i-1, 1);
可以转到外部循环,因为j
永远不会在内循环中发生变化。
for(my $i = 1; $i <= length($seq2); $i++) {
my $letter2 = substr($seq2, $i-1, 1);
for(my $j = 1; $j <= length($seq1); $j++) {
避免使用速度较慢且复杂的C风格循环。
for my $i (1..length($seq2)) {
my $letter2 = substr($seq2, $i-1, 1);
for my $j (1..length($seq1)) {
使用整数作为pointer
的值,而不是字符串。您可以使用常量来保持其可读性。
use constant {
POINTER_NONE => 0,
POINTER_LEFT => 1,
...
};
预先计算$j-1
和$i-1
也可能带来一些小优势。
请注意您应该在每次更改之前和之前对代码进行分析,以查看速度是否会提高。
所有这些都是微小的改进。真正的问题是你有一个二次算法。
答案 1 :(得分:1)
真正的问题是你有一个二次算法。它本身就会很慢。然而,该算法是解决最长公共子串问题的动态方法的变体,你不能做得更好。
也就是说,这是一个具有以下功能的实现:
use strict;
use warnings;
use feature qw( say );
sub local_alignment {
my ($s, $t, %scheme) = @_;
my $empty = [];
my $MATCH = $scheme{ MATCH } // +1; # Must be >= 1 # /
my $MISMATCH = $scheme{ MISMATCH } // -1; # Must be <= 0 # /
my $GAP = $scheme{ GAP } // -1; # Must be <= 0 # /
my $m = my @s = unpack('(a)*', $s);
my $n = my @t = unpack('(a)*', $t);
my @best_score_at_k = ( 0 ) x (1+$m+$n);
my @best_paths_at_k = ( $empty ) x (1+$m+$n);
my $offset = 1+$m;
my $best_score = 0;
my @best_paths;
for my $i (0..$m-1) {
--$offset;
for my $j (0..$n-1) {
my $k = $j + $offset;
my $diag_score = $best_score_at_k[$k] + ( $s[$i] eq $t[$j] ? $MATCH : $MISMATCH );
my $up_score = $best_score_at_k[$k+1] + $GAP;
my $left_score = $best_score_at_k[$k-1] + $GAP;
if ($diag_score <= 0 && $up_score <= 0 && $left_score <= 0) {
$best_score_at_k[$k] = 0;
$best_paths_at_k[$k] = $empty;
next;
}
my $new_score = 1;
my @new_paths;
if ($diag_score >= $new_score) {
if ($diag_score > $new_score) {
$new_score = $diag_score;
@new_paths = ();
}
push @new_paths, map { $_ . "\3" } @{ $best_paths_at_k[$k] } ? @{ $best_paths_at_k[$k] } : pack('JJ', $i, $j);
}
if ($up_score >= $new_score) {
if ($up_score > $new_score) {
$new_score = $up_score;
@new_paths = ();
}
# @{ $best_paths_at_k[$k+1] } will never be empty because a gap will never start a sequence.
push @new_paths, map { $_ . "\1" } @{ $best_paths_at_k[$k+1] };
}
if ($left_score >= $new_score) {
if ($left_score > $new_score) {
$new_score = $left_score;
@new_paths = ();
}
# @{ $best_paths_at_k[$k+1] } will never be empty because a gap will never start a sequence.
push @new_paths, map { $_ . "\2" } @{ $best_paths_at_k[$k-1] };
}
$best_score_at_k[$k] = $new_score;
$best_paths_at_k[$k] = \@new_paths;
if ($new_score >= $best_score) {
if ($new_score > $best_score) {
$best_score = $new_score;
@best_paths = ();
}
push @best_paths, @new_paths;
}
}
}
return
map {
my ($I, $J, @path) = unpack 'JJC*', $_;
my ($align1, $align2);
my $i = $I-1;
my $j = $J-1;
for (@path) {
$align1 .= ( $_ & 1 ) ? $s[++$i] : '-';
$align2 .= ( $_ & 2 ) ? $t[++$j] : '-';
}
[ $I, $align1, $J, $align2 ]
}
@best_paths;
}
die "usage\n" if @ARGV != 2;
say sprintf '%2$s (at pos %1$s) / %4$s (at pos %3$s)', @$_ for local_alignment(@ARGV);
测试:
$ ./local_alignment COELACANTH PELICAN
ELACAN (at pos 2) / ELICAN (at pos 1)
$ ./local_alignment COELACANTH PELICANPELICAN
ELACAN (at pos 2) / ELICAN (at pos 1)
ELACAN (at pos 2) / ELICAN (at pos 8)
$ ./local_alignment ABCE ABCDE
ABC (at pos 0) / ABC (at pos 0)
ABC-E (at pos 0) / ABCDE (at pos 0)
$ ./local_alignment ABCDE ABDCE
AB (at pos 0) / AB (at pos 0)
AB-C (at pos 0) / ABDC (at pos 0)
ABCD (at pos 0) / AB-D (at pos 0)
AB-CDE (at pos 0) / ABDC-E (at pos 0)
ABCD-E (at pos 0) / AB-DCE (at pos 0)