Question

我的文件类似于下面的文件，我正在尝试进行图像

中提到的数字剖析

mumeric profiling methodology

 >File Sample
 attttttttttttttacgatgccgggggatgcggggaaatttccctctctctctcttcttctcgcgcgcg
 aaaaaaaaaaaaaaagcgcggcggcgcggasasasasasasaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

我必须映射大小为2的每个子字符串，然后将其映射到不同ptoperties的33值，然后根据窗口大小5添加。

    my  %temp = (
                 aCount => {
                        aa =>2
                 }
                 cCount => {
                        aa => 0
                 }
    );

我目前的实施包括以下内容，

   while (<FILE>) {
     my $line = $_;
     chomp $line;

     while ($line=~/(.{2})/og) {
        $subStr = $1;
        if (exists $temp{aCount}{$subStr}) {

          push @{$temp{aCount_array}},$temp{aCount}{$subStr};

          if (scalar(@{$temp{aCount_array}}) == $WINDOW_SIZE) {

                my $sum = eval (join('+',@{$temp{aCount_array}}));
                shift @{$temp{aCount_array}};
                #Similar approach has been taken to other 33 rules
          }

        }

        if (exists $temp{cCount}{$subStr}) {
             #similar approach 
        }

        $line =~s/.{1}//og;
     }
   }

是否有其他方法可以提高整个过程的速度

Answer 1

正则表达式非常棒，但是当你需要的是固定宽度的子串时，它们可能会过度。替代方案为substr

$len = length($line);
for ($i=0; $i<$len; $i+=2) {
   $subStr = substr($line,$i,2);
   ...
}

或unpack

foreach $subStr (unpack "(A2)*", $line) {
   ...
}

我不知道这些会比正则表达式快多少，但I know how I would find out。

用于执行字符串的数字配置文件的算法

1 个答案: