Question

我想在Perl中实现这个算法。让我们接受：

DNA1 = GACTAGGC
DNA2 = AGCTAGGA

enter image description here

DNA1的第一个元素是G，我们会发现DNA2上是否有G并用点指向它。我们继续它直到结束，因此图像显示evey相同的元素交叉点作为点。

下一步是：连接点。要指向点，首先应该在一个小方块的左上角，另一个应该在右下方（我的意思是线应该有135度）如果严格性为2，则意味着拒绝从2发生的线和少于2个点（这意味着如果严格性为3，则图像上只有一条线）。

最后一步是：wordcount。如果wordcount为1（图像中为1），则表示逐个比较元素。如果它是3则表示将它们中的3个进行比较。您可以编写wordcount为1的程序，因为它始终为1.

我搜索过它，这就是我所拥有的：

$infile1 = "DNA1.txt";
$infile2 = "DNA2.txt";

$outfile = "plot.txt";
$wordsize=0;
$stringency=0;

open inf, $infile1 or die "STOP! File $infile1 not found.\n";
$sequence1=<inf>;
chomp $sequence1;
@seq1=split //,$sequence1;
close inf;

open inf, $infile2 or die "STOP! File $infile2 not found.\n";
$sequence2=<inf>;
chomp $sequence2;
@seq2=split //,$sequence2;
close inf;

$Lseq1=$#seq1+1;
$Lseq2=$#seq2+1;

open ouf, ">$outfile";

for ($i=0;$i<$Lseq1;$i++){
print ouf "\n";
for ($j=0;$j<$Lseq2;$j++){
  $match=0;
  for ($w=0;$w<=$wordsize;$w++){
    if($seq1[$i+$w] eq $seq2[$j+$w]){
      $match++;
    }
  }
  if($match > $stringency){
     print ouf "1";
  }
  else{
     print ouf "0";
  }
}
}

你能检查一下错误吗？如何在Perl中用更少的代码优化我的代码？

PS：我认为每次接受$ wordsize等于$ stringency是可以的。

编辑1：我编辑了我的代码，它只适用于放置点。

编辑2：算法就是这样：

qseq, sseq = sequences
win = number of elements to compare for each point
Strig = number of matches required for a point

for each q in qseq:
  for each s in sseq:
    if CompareWindow(qseq[q:q+win], s[s:s+win], strig):
      AddDot(q, s)

编辑3：以下是更好的算法建议：

osl.iu.edu/~chemuell/projects/bioinf/dotplot.ppt

欢迎根据更好的算法改进代码的任何想法。

Answer 1

首先，最里面的for循环是完全没必要的。摆脱它会加速你的代码。

其次，除了0之外，你的代码不适用于$ stringency。

修正：

use strict;
use warnings;

my $s1 = 'GACTAGGC';
my $s2 = 'AGCTAGGA';
my $stringency = 0;

my @s1 = split //, $s1;
my @s2 = split //, $s2;
my @L;
for my $i (0..$#s1) {
   for my $j (0..$#s2) {
      if ($s1[$i] ne $s2[$j]) {
         $L[$i][$j] = 0;
      } elsif ($i == 0 || $j == 0) {
         $L[$i][$j] = 1;
      } else {
         $L[$i][$j] = $L[$i-1][$j-1] + 1;
      }

      print $L[$i][$j] <= $stringency ? "0" : "1";
   }
   print("\n");
}

现在我们有了一个高效的算法，我们可以调整实现。

use strict;
use warnings;

my $s1 = 'GACTAGGC';
my $s2 = 'AGCTAGGA';
my $stringency = 0;

my @s1 = split //, $s1;
my @s2 = split //, $s2;
my @L = (0) x @s2;
for my $i (0..$#s1) {
   for my $j (0..$#s2) {
      if ($s1[$i] eq $s2[$j]) {
         ++$L[$j];
      } else {
         $L[$j] = 0;
      }

      print $L[$j] <= $stringency ? "0" : "1";
   }

   print("\n");
   pop @L;
   unshift @L, 0;
}

如果您想更好地了解正在发生的事情，请替换

print $L[$j] <= $stringency ? "0" : "1";

与

print $L[$j];

你会得到像

这样的东西

顺便说一句，尝试实现的内容与找到longest common substring非常相似。

更新要从文件中获取$s1和$s2，一次一行，

open(my $fh1, '<', ...) or die(...);
open(my $fh2, '<', ...) or die(...);

for (;;) {
    my $s1 = <$fh1>;
    my $s2 = <$fh2>;

    die("Files have different length\n")
        if defined($s1) && !defined($s2)
        || !defined($s1) && defined($s2);

    last if !defined(($s1);

    chomp($s1);
    chomp($s2);

    ... code to generate graph ...
}

在Perl中使用较少的代码行实现此算法

1 个答案: