我是命令行新手。我有一个长文本文件(samp.txt),其中包含以空格分隔的列。 awk / sed / perl帮助表示赞赏。
Id Pos Re Va Cn SF:R1 SR He Ho NC
c|371443199 22 G A R Pass:8 0 1 0 0
c|371443199 25 C A M Pass:13 0 0 1 0
c|371443199 22 G A R Pass:8 0 1 0 0
c|367079424 17 C G S Pass:19 0 0 1 0
c|371443198 17 G A R Pass:18 0 1 0 0
c|367079424 17 G A R Pass:18 0 0 1 0
我想要计算每个唯一ID(计数唯一ID出现次数),计数第6列(第6列=通过),计算He(第8列)和Ho(第9列)多少。我想得到像这样的结果
Id CountId Countpass CountHe CountHO
cm|371443199 3 3 2 1
cm|367079424 2 2 0 2
答案 0 :(得分:2)
awk '{ids[$1]++; pass[$1] = "?"; he[$1] += $8; ho[$1] += $9} END {OFS = "\t"; print "Id", "CountId", "Countpass", "CountHe", "CountHO"; for (id in ids) {print id, ids[id], pass[id], he[id], ho[id]}' inputfile
分成多行:
awk '{
ids[$1]++;
pass[$1] = "?"; # I'm not sure what you want here
he[$1] += $8;
ho[$1] += $9
}
END {
OFS = "\t";
print "Id", "CountId", "Countpass", "CountHe", "CountHO";
for (id in ids) {
print id, ids[id], pass[id], he[id], ho[id]
}' inputfile
答案 1 :(得分:1)
您的输入中似乎有拼写错误,您放置...98
而不是...99
。假设是这种情况,您的其他信息和预期输出是有意义的。
使用数组存储id以保留id的原始顺序。
use strict;
use warnings;
use feature 'say'; # to enable say()
my $hdr = <DATA>; # remove header
my %hash;
my @keys;
while (<DATA>) {
my ($id,$pos,$re,$va,$cn,$sf,$sr,$he,$ho,$nc) = split;
$id =~ s/^c\K/m/;
$hash{$id}{he} += $he;
$hash{$id}{ho} += $ho;
$hash{$id}{pass}{$sf}++;
$hash{$id}{count}++;
push @keys, $id if $hash{$id}{count} == 1;
}
say join "\t", qw(Id CountId Countpass CountHe CountHO);
for my $id (@keys) {
say join "\t", $id,
$hash{$id}{count}, # occurences of id
scalar keys $hash{$id}{pass}, # the number of unique passes
@{$hash{$id}}{qw(he ho)};
}
__DATA__
Id Pos Re Va Cn SF:R1 SR He Ho NC
c|371443199 22 G A R Pass:8 0 1 0 0
c|371443199 25 C A M Pass:13 0 0 1 0
c|371443199 22 G A R Pass:8 0 1 0 0
c|367079424 17 C G S Pass:19 0 0 1 0
c|371443198 17 G A R Pass:18 0 1 0 0
c|367079424 17 G A R Pass:18 0 0 1 0
<强>输出:强>
Id CountId Countpass CountHe CountHO
cm|371443199 3 2 2 1
cm|367079424 2 2 0 2
cm|371443198 1 1 1 0
注意:我使输出制表符分隔,以便于后期处理。如果你想要它漂亮,可以使用printf
来获得一些固定宽度的字段。