我想在第一个" - "之前根据相似的字符串对行进行求和。标志。我试过R,但文件太大了。
in
URS0000001D42-antisense_ATTTCGGTTGGGGAA 208
URS0000001D42-antisense_CATGCTCATAAGGAA 24
URS0000003804-lncRNA_GAGATCCTGGGTTTT 6
URS0000003CBA-antisense_CTGGGCTAGTGAACGCGGCGAAGT 14
URS0000003F61-antisense_AAAGTGCACTTGGACG 55
URS0000003F61-antisense_AAAGTGCACTTGGACGAA 4
out
URS0000001D42-antisense 232
URS0000003804-lncRNA 6
URS0000003CBA-antisense 14
URS0000003F61-antisense 59
答案 0 :(得分:1)
使用awk
:
awk '{a[$1]+=$NF}END{for (i in a){print i,a[i]}}' FS='_| ' file
<强>结果强>
URS0000003804-lncRNA 6
URS0000001D42-antisense 232
URS0000003CBA-antisense 14
URS0000003F61-antisense 59
答案 1 :(得分:1)
使用perl哈希:
脚本:
#!/usr/bin/env perl
while (my ($key, $value) = <> =~ /^(.+)_.+\s+(\d+)/) {
$hash{$key} += $value;
}
while(my($k, $v) = each %hash) {
print "$k\t$v\n";
}
致电:
$ script.pl < file
URS0000003CBA-antisense: 14
URS0000003F61-antisense: 59
URS0000003804-lncRNA: 6
URS0000001D42-antisense: 232
$
也可以做得更短。 ; - )
并here's另一个问题是一个非常相似的任务,有很多答案。
答案 2 :(得分:0)
这是一个Perl解决方案
use strict;
use warnings 'all';
my %data;
while ( <DATA> ) {
my ( $f1, $f2, $seq, $n ) = m/[^-_\s]+/g;
$data{$f1}{$f2} += $fields[3];
}
for my $f1 ( keys %data ) {
for my $f2 ( keys %{ $data{$f1} } ) {
printf "%s-%s %d\n", $f1, $f2, $data{$f1}{$f2};
}
}
__DATA__
URS0000001D42-antisense_ATTTCGGTTGGGGAA 208
URS0000001D42-antisense_CATGCTCATAAGGAA 24
URS0000003804-lncRNA_GAGATCCTGGGTTTT 6
URS0000003CBA-antisense_CTGGGCTAGTGAACGCGGCGAAGT 14
URS0000003F61-antisense_AAAGTGCACTTGGACG 55
URS0000003F61-antisense_AAAGTGCACTTGGACGAA 4
URS0000003CBA-antisense 14
URS0000001D42-antisense 232
URS0000003804-lncRNA 6
URS0000003F61-antisense 59
输出无序,因为Perl哈希没有固有的顺序。保持输出的顺序与输入数据相同有点困难,因为必须为每个哈希保留一个数组,以跟踪创建密钥的顺序
use strict;
use warnings 'all';
my ( %data, @keys );
while ( <DATA> ) {
my ( $f1, $f2, $seq, $n ) =/ [^-_\s]+/g;
push @keys, $f1 unless $data{$f1};
my $h2 = $data{$f1} //= {};
push @{ $h2->{''} }, $f2 unless $h2->{$f2};
$h2->{$f2} += $n;
}
for my $f1 ( @keys ) {
for my $f2 ( @{ $data{$f1}{''} } ) {
printf "%s-%s %d\n", $f1, $f2, $data{$f1}{$f2};
}
}
__DATA__
URS0000001D42-antisense_ATTTCGGTTGGGGAA 208
URS0000001D42-antisense_CATGCTCATAAGGAA 24
URS0000003804-lncRNA_GAGATCCTGGGTTTT 6
URS0000003CBA-antisense_CTGGGCTAGTGAACGCGGCGAAGT 14
URS0000003F61-antisense_AAAGTGCACTTGGACG 55
URS0000003F61-antisense_AAAGTGCACTTGGACGAA 4
URS0000001D42-antisense 232
URS0000003804-lncRNA 6
URS0000003CBA-antisense 14
URS0000003F61-antisense 59