我的文件类似于下面的文件,我正在尝试进行图像
中提到的数字剖析
>File Sample
attttttttttttttacgatgccgggggatgcggggaaatttccctctctctctcttcttctcgcgcgcg
aaaaaaaaaaaaaaagcgcggcggcgcggasasasasasasaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
我必须映射大小为2的每个子字符串,然后将其映射到不同ptoperties的33值,然后根据窗口大小5添加。
my %temp = (
aCount => {
aa =>2
}
cCount => {
aa => 0
}
);
我目前的实施包括以下内容,
while (<FILE>) {
my $line = $_;
chomp $line;
while ($line=~/(.{2})/og) {
$subStr = $1;
if (exists $temp{aCount}{$subStr}) {
push @{$temp{aCount_array}},$temp{aCount}{$subStr};
if (scalar(@{$temp{aCount_array}}) == $WINDOW_SIZE) {
my $sum = eval (join('+',@{$temp{aCount_array}}));
shift @{$temp{aCount_array}};
#Similar approach has been taken to other 33 rules
}
}
if (exists $temp{cCount}{$subStr}) {
#similar approach
}
$line =~s/.{1}//og;
}
}
是否有其他方法可以提高整个过程的速度
答案 0 :(得分:0)
正则表达式非常棒,但是当你需要的是固定宽度的子串时,它们可能会过度。替代方案为substr
$len = length($line);
for ($i=0; $i<$len; $i+=2) {
$subStr = substr($line,$i,2);
...
}
或unpack
foreach $subStr (unpack "(A2)*", $line) {
...
}
我不知道这些会比正则表达式快多少,但I know how I would find out。