如何在循环时比perl更快地计算每个窗口的列数

时间:2014-03-21 21:49:17

标签: perl awk

我想计算匹配窗口的列数。我试过awk但它太慢了。

例如,我有以下两个窗口

chr1-100-1000
chr1-1500-3000

对于这两个窗口,我找到了以下匹配项,并希望根据最后一列是1还是0来计算第6列。

chr1 100 1000 chr1 200 0 1 0
chr1 100 1000 chr1 500 0 4 0
chr1 100 1000 chr1 700 0 6 1
chr1 1500 3000 chr1 2000 0 9 1
chr1 1500 3000 chr1 2000 0 1 0

我想要的结果将是

chr1 100 1000 6/11
chr1 1500 3000 9/10

我尝试在perl中使用while循环执行此操作,但由于我有数百万条目,因此速度非常慢。这是我试过的

while (my $line = <IN>){
    chomp $line;
    my ($chrV,$start,$end) = split("-",$line);

    my $total_mcTotal = `awk '{if (\$2 == $start && \$3 == $end) print \$8}' chr$chr\_intersect_temp | awk \'{sumT+=\$1} END {print sumT}\'`;
    chomp $total_mcTotal;

    `awk '{if (\$2 == $start && \$3 == $end) print \$7}' chr$chr\_intersect_Meth_temp > temp_$chr`;
    my $total_mcCount = `awk \'{sum+=\$1} END {print sum}\' temp_$chr`;
    chomp $total_mcCount;

有更快的解决方案吗?

2 个答案:

答案 0 :(得分:3)

如果您可以保证订购数据,可以简化:

use strict;
use warnings;

my @keys;
my %vals;

while (<DATA>) {
    s{(\S+\s+\S+\s+\S+)\s+}{} or warn("No key at line $.: $_") and next;
    my $key = $1;
    my @data = split;
    if (!$vals{$key}) {
        push @keys, $key;
        $vals{$key} = {n => 0, d => 0}; # Ensure n gets initialized
    }
    $vals{$key}{d} += $data[3];
    $vals{$key}{n} += $data[3] if $data[4];
}

for (@keys) {
    # printf "%s %d/%d\n", $_, $vals{$_}{n}, $vals{$_}{d};
    my $fraction = $vals{$_}{d}
        ? sprintf("%.02f", $vals{$_}{n}/$vals{$_}{d})
        : 'NaN';
    print "$_ $fraction\n";
}

__DATA__
chr1 100 1000 chr1 200 0 1 0
chr1 100 1000 chr1 500 0 4 0
chr1 100 1000 chr1 700 0 6 1
chr1 1500 3000 chr1 2000 0 9 1
chr1 1500 3000 chr1 2000 0 1 0

输出:

chr1 100 1000 6/11
chr1 1500 3000 9/10

修改

或者不担心钥匙的具体间距:

while (<DATA>) {
    my @data = split;
    my $key = join ' ', @data[0..2];
    push @keys, $key if !$vals{$key};
    $vals{$key}{d} += $data[6];
    $vals{$key}{n} += $data[6] if $data[7];
}

答案 1 :(得分:0)

您只能使用awk编写代码。

awk 'NR==FNR{str=$1 FS $2 FS $3;p[str FS $NF]+=$(NF-1);next}
{ str=$1 OFS $2 OFS $3;
  print str,p[str OFS "1"] "/" p[str OFS "1"]+p[str OFS "0"]
}' file2 FS="-" file1

chr1 100 1000 6/11
chr1 1500 3000 9/10

解释

  • str=$1 FS $2 FS $3;p[str FS $NF]+=$(NF-1)设置4D数组并对$(NF-1)
  • 上的值求和