按两个其他列分组的文件总计一列

时间:2018-04-27 05:41:37

标签: bash perl awk hash

我想在第一个" - "之前根据相似的字符串对行进行求和。标志。我试过R,但文件太大了。

in

URS0000001D42-antisense_ATTTCGGTTGGGGAA 208
URS0000001D42-antisense_CATGCTCATAAGGAA 24
URS0000003804-lncRNA_GAGATCCTGGGTTTT    6
URS0000003CBA-antisense_CTGGGCTAGTGAACGCGGCGAAGT        14
URS0000003F61-antisense_AAAGTGCACTTGGACG        55
URS0000003F61-antisense_AAAGTGCACTTGGACGAA      4

out

URS0000001D42-antisense 232
URS0000003804-lncRNA 6
URS0000003CBA-antisense 14
URS0000003F61-antisense 59

3 个答案:

答案 0 :(得分:1)

使用awk

awk '{a[$1]+=$NF}END{for (i in a){print i,a[i]}}' FS='_| ' file

<强>结果

URS0000003804-lncRNA 6
URS0000001D42-antisense 232
URS0000003CBA-antisense 14
URS0000003F61-antisense 59

答案 1 :(得分:1)

使用perl哈希:

脚本:

#!/usr/bin/env perl

while (my ($key, $value) = <> =~ /^(.+)_.+\s+(\d+)/) {
  $hash{$key} += $value;
}

while(my($k, $v) = each %hash) { 
  print "$k\t$v\n";
}

致电:

$ script.pl < file
URS0000003CBA-antisense:  14
URS0000003F61-antisense:  59
URS0000003804-lncRNA:  6
URS0000001D42-antisense:  232
$

也可以做得更短。 ; - )

here's另一个问题是一个非常相似的任务,有很多答案。

答案 2 :(得分:0)

这是一个Perl解决方案

use strict;
use warnings 'all';

my %data;

while ( <DATA> ) {
    my ( $f1, $f2, $seq, $n ) = m/[^-_\s]+/g;
    $data{$f1}{$f2} += $fields[3];
}

for my $f1 ( keys %data ) {

    for my $f2 ( keys %{ $data{$f1} } ) {
        printf "%s-%s %d\n", $f1, $f2, $data{$f1}{$f2};
    }
}

__DATA__
URS0000001D42-antisense_ATTTCGGTTGGGGAA 208
URS0000001D42-antisense_CATGCTCATAAGGAA 24
URS0000003804-lncRNA_GAGATCCTGGGTTTT    6
URS0000003CBA-antisense_CTGGGCTAGTGAACGCGGCGAAGT        14
URS0000003F61-antisense_AAAGTGCACTTGGACG        55
URS0000003F61-antisense_AAAGTGCACTTGGACGAA      4

输出

URS0000003CBA-antisense 14
URS0000001D42-antisense 232
URS0000003804-lncRNA 6
URS0000003F61-antisense 59

输出无序,因为Perl哈希没有固有的顺序。保持输出的顺序与输入数据相同有点困难,因为必须为每个哈希保留一个数组,以跟踪创建密钥的顺序

use strict;
use warnings 'all';

my ( %data, @keys );

while ( <DATA> ) {

    my ( $f1, $f2, $seq, $n ) =/ [^-_\s]+/g;

    push @keys, $f1 unless $data{$f1};

    my $h2 = $data{$f1} //= {};

    push @{ $h2->{''} }, $f2 unless $h2->{$f2};

    $h2->{$f2} += $n;
}

for my $f1 ( @keys ) {

    for my $f2 ( @{ $data{$f1}{''} } ) {
        printf "%s-%s %d\n", $f1, $f2, $data{$f1}{$f2};
    }
}

__DATA__
URS0000001D42-antisense_ATTTCGGTTGGGGAA 208
URS0000001D42-antisense_CATGCTCATAAGGAA 24
URS0000003804-lncRNA_GAGATCCTGGGTTTT    6
URS0000003CBA-antisense_CTGGGCTAGTGAACGCGGCGAAGT        14
URS0000003F61-antisense_AAAGTGCACTTGGACG        55
URS0000003F61-antisense_AAAGTGCACTTGGACGAA      4

输出

URS0000001D42-antisense 232
URS0000003804-lncRNA 6
URS0000003CBA-antisense 14
URS0000003F61-antisense 59