根据第二列的值为列的每个唯一值输出一次整行

时间:2014-03-06 05:16:40

标签: bash awk uniq

我的问题与之前提出的问题非常相似:

Output whole line once for each unique value of a column (Bash)

但有一个主要区别。在他的例子中:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> AIQLTGK        8   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> AIQLTGK        10  genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750
pep> VSSILEDKILSR   2   genes ADUm.2146,ADUm.5750

目标是在第2列中为肽的每个不同值打印一行,这意味着上述输入将成为:"

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

但我想要做的是为第2列中的每个唯一条目打印一行,但是我想打印第3列中具有最高值的行,因此输出将如下所示:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        10  genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

提前致谢。

1 个答案:

答案 0 :(得分:1)

这是一种方法:

awk '
($2 in seen) {
    line[$2] = ($3 > seen[$2]) ? $0 : line[$2];
    next
}
{
    seen[$2] = $3;
    line[$2] = $0
}
END {
    for(x in line) print line[x]
}' file

<强>输出:

pep> AIQLTGK        10  genes ADUm.1999,ADUm.3560
pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> KHEPPTEVDIEGR  5   genes ADUm.367