我想计算匹配窗口的列数。我试过awk但它太慢了。
例如,我有以下两个窗口
chr1-100-1000
chr1-1500-3000
对于这两个窗口,我找到了以下匹配项,并希望根据最后一列是1还是0来计算第6列。
chr1 100 1000 chr1 200 0 1 0
chr1 100 1000 chr1 500 0 4 0
chr1 100 1000 chr1 700 0 6 1
chr1 1500 3000 chr1 2000 0 9 1
chr1 1500 3000 chr1 2000 0 1 0
我想要的结果将是
chr1 100 1000 6/11
chr1 1500 3000 9/10
我尝试在perl中使用while循环执行此操作,但由于我有数百万条目,因此速度非常慢。这是我试过的
while (my $line = <IN>){
chomp $line;
my ($chrV,$start,$end) = split("-",$line);
my $total_mcTotal = `awk '{if (\$2 == $start && \$3 == $end) print \$8}' chr$chr\_intersect_temp | awk \'{sumT+=\$1} END {print sumT}\'`;
chomp $total_mcTotal;
`awk '{if (\$2 == $start && \$3 == $end) print \$7}' chr$chr\_intersect_Meth_temp > temp_$chr`;
my $total_mcCount = `awk \'{sum+=\$1} END {print sum}\' temp_$chr`;
chomp $total_mcCount;
有更快的解决方案吗?
答案 0 :(得分:3)
如果您可以保证订购数据,可以简化:
use strict;
use warnings;
my @keys;
my %vals;
while (<DATA>) {
s{(\S+\s+\S+\s+\S+)\s+}{} or warn("No key at line $.: $_") and next;
my $key = $1;
my @data = split;
if (!$vals{$key}) {
push @keys, $key;
$vals{$key} = {n => 0, d => 0}; # Ensure n gets initialized
}
$vals{$key}{d} += $data[3];
$vals{$key}{n} += $data[3] if $data[4];
}
for (@keys) {
# printf "%s %d/%d\n", $_, $vals{$_}{n}, $vals{$_}{d};
my $fraction = $vals{$_}{d}
? sprintf("%.02f", $vals{$_}{n}/$vals{$_}{d})
: 'NaN';
print "$_ $fraction\n";
}
__DATA__
chr1 100 1000 chr1 200 0 1 0
chr1 100 1000 chr1 500 0 4 0
chr1 100 1000 chr1 700 0 6 1
chr1 1500 3000 chr1 2000 0 9 1
chr1 1500 3000 chr1 2000 0 1 0
输出:
chr1 100 1000 6/11
chr1 1500 3000 9/10
修改强>
或者不担心钥匙的具体间距:
while (<DATA>) {
my @data = split;
my $key = join ' ', @data[0..2];
push @keys, $key if !$vals{$key};
$vals{$key}{d} += $data[6];
$vals{$key}{n} += $data[6] if $data[7];
}
答案 1 :(得分:0)
您只能使用awk编写代码。
awk 'NR==FNR{str=$1 FS $2 FS $3;p[str FS $NF]+=$(NF-1);next}
{ str=$1 OFS $2 OFS $3;
print str,p[str OFS "1"] "/" p[str OFS "1"]+p[str OFS "0"]
}' file2 FS="-" file1
chr1 100 1000 6/11
chr1 1500 3000 9/10
str=$1 FS $2 FS $3;p[str FS $NF]+=$(NF-1)
设置4D数组并对$(NF-1)